Airbnb, a global online marketplace headquartered in San Francisco, California, was founded in 2008 and operates in over 191 countries. It enables individuals to monetize their spaces by renting them out to travelers.
Through Airbnb’s platform, hosts can offer guests short-term lodging and tourism-related activities. Guests have the flexibility to search for accommodations based on type, dates, location, and price, and can even seek out specific types of homes, such as bed and breakfasts, unique homes, and vacation homes.
Known for its diverse accommodations, Airbnb boasts a variety of properties ranging from single rooms and apartments to moored yachts, houseboats, entire islands, and even castles.
By offering unique and personalized experiences, Airbnb has transformed the travel industry. A prime example of this success is Melbourne, Australia.
This city, celebrated for its cultural diversity, blend of modern and historical architecture, lively arts scene, and passion for good food and coffee, perfectly embodies Airbnb’s ethos of local experiences and personal connections. The wide variety of unique accommodations available in Melbourne further enhances its appeal to travelers.
In light of this, I intend to conduct an analysis of Melbourne Airbnb Data. This dataset contains information about hosts, neighbourhoods, room types, prices, reviews, ratings, etc. Through the Inside Airbnb website, it is possible to collect this dataset and you can even find Airbnb data for other cities around the world.
This analysis aims to explore the data, perform necessary data cleaning steps to prepare it for analysis, and then delve into the data to reveal trends and patterns. The goal is to provide valuable insights into how Airbnb is utilized in Melbourne and understand its implications for hosts, guests, and the wider community.
With that in mind, I have defined the problem statement as the following: Analyze the key characteristics of Airbnb listings in Melbourne to gain a comprehensive understanding of the local rental market.
In alignement with the problem statement, the following research questions have been formulated:
Who are the top 10 hosts with the most listings?
Where are the top 10 Melbourne Airbnbs listings located?
How are the room type distributed in the listings?
What are the 5 most common numbers of bedrooms in the listings?
What are the 5 most frequent minimum nights?
What are the four amenities most commonly provided?
Which room type in Melbourne's Airbnb listings has the highest number of reviews?
What is the average of amenities by room type?
Is the average price per night higher for private room than for shared rooms?
Properties with more reviews might be priced higher due to their popularity?
Listings with more amenities might be priced higher?
I'm about to embark on a journey of data exploration and analysis. To aid me in this quest, I've gathered a set of powerful tools, each with its own unique capabilities. With these tools at my disposal, I'm ready to dive into the world of data and uncover the answers it holds.
# Import the libraries
import pandas as pd # Used for data manipulation and analysis
import numpy as np # Handles numerical operations like mathematical computations
import matplotlib.pyplot as plt # Creates static, animated, and intereactive visualizations
import seaborn as sns # Makes statistical graphics
import plotly.express as px # Generate interactive plots
import plotly.graph_objects as go # Generate interactive plots
import statistics as st # Provides functions to perform statistical operations
import geopandas as gpd # Deals with geospatial data
import ast # Allows Python code to be parsed into an abstract syntax tree (AST)
from collections import Counter # Keeps track of the frequency of elements in a collection
import warnings # Used to warn the user about situations that aren't necessarily exceptions
This code is setting up the environment for data analysis and visualization. These settings help in making the data analysis process more efficient and the output more readable.
# Set up the pandas
pd.set_option('display.max_columns', None) # Display all columns in DataFrame.
pd.set_option('display.max_rows', None) # Display all rows in DataFrame.
# Set up the matplotlib
plt.style.use('ggplot') # Apply 'ggplot' style to plots.
# Set up warnings
warnings.filterwarnings('ignore') # Ignore all warning messages.
I'm going to import the data from a CSV file into a DataFrame, preparing it for our upcoming analysis.
# Import DataFrame
data = pd.read_csv("/kaggle/input/melbourne-airbnb-september-2023-dataset/listings.csv")
This stage involves three main steps:
Understanding the DataFrame: This involves getting to know the structure of the DataFrame, its columns (which represent the variables or features in the data), and the type of data each column holds.
Descriptive Statistics: This is where we start to dive deeper into the data. We calculate summary statistics like mean, median, mode, standard deviation etc., for each feature in the DataFrame. This gives us a sense of the distribution of values and can help identify trends and patterns.
Data Quality Assessment: In this step, we check the quality of our data. We look for missing values, outliers, or inconsistent data entries, and decide to handle them. This ensures that our analysis is based on clean, reliable data.
These steps provide a solid foundation for any data analysis task. They help us understand what our data looks like and ensure it's in good shape for further analysis.
Using the .shape attribute we can see a tuple representing the dimensionality of the DataFrame. The first element of the tuple will give you the number of rows (23185) and the second element will give you the number of columns (75). This is a quick way to know how large your dataset is.
# Check the DataFrame dimension
data.shape
(23185, 75)
The .dtypes attribute returns a Series with the data type of each column. The result will be in the format column_name: data_type. This is useful to know because the data type of the columns can affect what operations you can perform on them.
# Check data types
data.dtypes
id int64 listing_url object scrape_id int64 last_scraped object source object name object description object neighborhood_overview object picture_url object host_id int64 host_url object host_name object host_since object host_location object host_about object host_response_time object host_response_rate object host_acceptance_rate object host_is_superhost object host_thumbnail_url object host_picture_url object host_neighbourhood object host_listings_count float64 host_total_listings_count float64 host_verifications object host_has_profile_pic object host_identity_verified object neighbourhood object neighbourhood_cleansed object neighbourhood_group_cleansed float64 latitude float64 longitude float64 property_type object room_type object accommodates int64 bathrooms float64 bathrooms_text object bedrooms float64 beds float64 amenities object price object minimum_nights int64 maximum_nights int64 minimum_minimum_nights int64 maximum_minimum_nights int64 minimum_maximum_nights int64 maximum_maximum_nights int64 minimum_nights_avg_ntm float64 maximum_nights_avg_ntm float64 calendar_updated float64 has_availability object availability_30 int64 availability_60 int64 availability_90 int64 availability_365 int64 calendar_last_scraped object number_of_reviews int64 number_of_reviews_ltm int64 number_of_reviews_l30d int64 first_review object last_review object review_scores_rating float64 review_scores_accuracy float64 review_scores_cleanliness float64 review_scores_checkin float64 review_scores_communication float64 review_scores_location float64 review_scores_value float64 license float64 instant_bookable object calculated_host_listings_count int64 calculated_host_listings_count_entire_homes int64 calculated_host_listings_count_private_rooms int64 calculated_host_listings_count_shared_rooms int64 reviews_per_month float64 dtype: object
This line of code is using the .head() method to display, in this case, the first 3 rows of the DataFrame 'data'. This is a quick way to preview the data and get a sense of what it looks like.
# Preview the first 3 rows
data.head(3)
| id | listing_url | scrape_id | last_scraped | source | name | description | neighborhood_overview | picture_url | host_id | host_url | host_name | host_since | host_location | host_about | host_response_time | host_response_rate | host_acceptance_rate | host_is_superhost | host_thumbnail_url | host_picture_url | host_neighbourhood | host_listings_count | host_total_listings_count | host_verifications | host_has_profile_pic | host_identity_verified | neighbourhood | neighbourhood_cleansed | neighbourhood_group_cleansed | latitude | longitude | property_type | room_type | accommodates | bathrooms | bathrooms_text | bedrooms | beds | amenities | price | minimum_nights | maximum_nights | minimum_minimum_nights | maximum_minimum_nights | minimum_maximum_nights | maximum_maximum_nights | minimum_nights_avg_ntm | maximum_nights_avg_ntm | calendar_updated | has_availability | availability_30 | availability_60 | availability_90 | availability_365 | calendar_last_scraped | number_of_reviews | number_of_reviews_ltm | number_of_reviews_l30d | first_review | last_review | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | license | instant_bookable | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10803 | https://www.airbnb.com/rooms/10803 | 20230904181306 | 2023-09-05 | city scrape | Rental unit in Brunswick East · ★4.49 · 1 bedr... | A large air conditioned room with firm queen s... | This hip area is a crossroads between two grea... | https://a0.muscache.com/pictures/e5f30dd1-ac57... | 38901 | https://www.airbnb.com/users/show/38901 | Lindsay | 2009-09-16 | Melbourne, Australia | As an artist working in animation and video I ... | within a few hours | 100% | 95% | f | https://a0.muscache.com/im/pictures/user/6860c... | https://a0.muscache.com/im/pictures/user/6860c... | Brunswick | 1.0 | 1.0 | ['email', 'phone'] | t | t | Brunswick East, Victoria, Australia | Moreland | NaN | -37.76606 | 144.97951 | Private room in rental unit | Private room | 2 | NaN | 1 shared bath | NaN | 1.0 | ["Microwave", "Hot water kettle", "Drying rack... | $49.00 | 5 | 14 | 1 | 5 | 14 | 14 | 5.0 | 14.0 | NaN | t | 4 | 13 | 35 | 201 | 2023-09-05 | 173 | 21 | 0 | 2013-01-12 | 2023-07-19 | 4.49 | 4.65 | 3.98 | 4.72 | 4.69 | 4.66 | 4.61 | NaN | f | 1 | 0 | 1 | 0 | 1.33 |
| 1 | 12936 | https://www.airbnb.com/rooms/12936 | 20230904181306 | 2023-09-05 | previous scrape | Rental unit in St Kilda · ★4.68 · 1 bedroom · ... | RIGHT IN THE HEART OF ST KILDA! It doesn't get... | A stay at our apartment means you can enjoy so... | https://a0.muscache.com/pictures/59701/2e8cdaf... | 50121 | https://www.airbnb.com/users/show/50121 | The A2C Team | 2009-10-31 | Melbourne, Australia | NaN | NaN | NaN | NaN | f | https://a0.muscache.com/im/pictures/user/3894a... | https://a0.muscache.com/im/pictures/user/3894a... | St Kilda | 10.0 | 20.0 | ['email', 'phone', 'work_email'] | t | t | St Kilda, Victoria, Australia | Port Phillip | NaN | -37.85999 | 144.97662 | Entire rental unit | Entire home/apt | 2 | NaN | 1 bath | 1.0 | 1.0 | ["Microwave", "Lockbox", "Hot water", "Smoke a... | $95.00 | 3 | 14 | 3 | 3 | 14 | 14 | 3.0 | 14.0 | NaN | t | 0 | 0 | 0 | 0 | 2023-09-05 | 42 | 0 | 0 | 2010-08-04 | 2020-03-15 | 4.68 | 4.78 | 4.71 | 4.83 | 4.83 | 4.78 | 4.66 | NaN | f | 10 | 10 | 0 | 0 | 0.26 |
| 2 | 38271 | https://www.airbnb.com/rooms/38271 | 20230904181306 | 2023-09-04 | city scrape | Rental unit in Berwick · ★4.86 · 3 bedrooms · ... | No children under 13 will be accepted in your ... | Our street is quiet & secluded but within walk... | https://a0.muscache.com/pictures/1182791/3bf4b... | 164193 | https://www.airbnb.com/users/show/164193 | Daryl & Dee | 2010-07-12 | Berwick, Australia | We are an active couple who work from home and... | within an hour | 100% | 91% | t | https://a0.muscache.com/im/pictures/user/29c38... | https://a0.muscache.com/im/pictures/user/29c38... | NaN | 1.0 | 1.0 | ['email', 'phone', 'work_email'] | t | t | Berwick, Victoria, Australia | Casey | NaN | -38.05723 | 145.33982 | Entire rental unit | Entire home/apt | 5 | NaN | 1 bath | 3.0 | 3.0 | ["Microwave", "Hot water kettle", "Drying rack... | $116.00 | 1 | 14 | 1 | 1 | 14 | 14 | 1.0 | 14.0 | NaN | t | 18 | 40 | 64 | 325 | 2023-09-04 | 228 | 36 | 3 | 2010-11-24 | 2023-08-26 | 4.86 | 4.92 | 4.98 | 4.91 | 4.94 | 4.90 | 4.88 | NaN | f | 1 | 1 | 0 | 0 | 1.47 |
Here I use .tail() method to display the last 3 rows of the DataFrame 'data'. This is a quick way to see the most recent entries in your data.
# Preview the last 3 rows
data.tail(3)
| id | listing_url | scrape_id | last_scraped | source | name | description | neighborhood_overview | picture_url | host_id | host_url | host_name | host_since | host_location | host_about | host_response_time | host_response_rate | host_acceptance_rate | host_is_superhost | host_thumbnail_url | host_picture_url | host_neighbourhood | host_listings_count | host_total_listings_count | host_verifications | host_has_profile_pic | host_identity_verified | neighbourhood | neighbourhood_cleansed | neighbourhood_group_cleansed | latitude | longitude | property_type | room_type | accommodates | bathrooms | bathrooms_text | bedrooms | beds | amenities | price | minimum_nights | maximum_nights | minimum_minimum_nights | maximum_minimum_nights | minimum_maximum_nights | maximum_maximum_nights | minimum_nights_avg_ntm | maximum_nights_avg_ntm | calendar_updated | has_availability | availability_30 | availability_60 | availability_90 | availability_365 | calendar_last_scraped | number_of_reviews | number_of_reviews_ltm | number_of_reviews_l30d | first_review | last_review | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | license | instant_bookable | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23182 | 971613881418926837 | https://www.airbnb.com/rooms/971613881418926837 | 20230904181306 | 2023-09-05 | city scrape | Home in Abbotsford · ★New · 4 bedrooms · 4 bed... | Cả nhóm sẽ dễ dàng tiếp cận mọi địa điểm từ ch... | NaN | https://a0.muscache.com/pictures/hosting/Hosti... | 535127590 | https://www.airbnb.com/users/show/535127590 | Trần | 2023-09-02 | Abbotsford, Australia | NaN | NaN | NaN | NaN | f | https://a0.muscache.com/im/pictures/user/User-... | https://a0.muscache.com/im/pictures/user/User-... | NaN | 1.0 | 1.0 | ['email', 'phone'] | t | t | NaN | Yarra | NaN | -37.807580 | 145.000010 | Entire home | Entire home/apt | 3 | NaN | 1 bath | 4.0 | 4.0 | ["Wifi", "Kitchen", "Free parking on premises"... | $214.00 | 1 | 365 | 1 | 1 | 365 | 365 | 1.0 | 365.0 | NaN | f | 29 | 59 | 89 | 269 | 2023-09-05 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | f | 1 | 1 | 0 | 0 | NaN |
| 23183 | 971616602219608699 | https://www.airbnb.com/rooms/971616602219608699 | 20230904181306 | 2023-09-04 | city scrape | Rental unit in Box Hill · ★New · 2 bedrooms · ... | Enjoy easy access to everything from this perf... | NaN | https://a0.muscache.com/pictures/hosting/Hosti... | 163554327 | https://www.airbnb.com/users/show/163554327 | Steven | 2017-12-21 | Box Hill South, Australia | NaN | within an hour | 100% | 99% | f | https://a0.muscache.com/im/pictures/user/70c58... | https://a0.muscache.com/im/pictures/user/70c58... | NaN | 12.0 | 21.0 | ['email', 'phone'] | t | t | NaN | Whitehorse | NaN | -37.818577 | 145.116941 | Entire rental unit | Entire home/apt | 4 | NaN | 2 baths | 2.0 | 3.0 | ["Pool", "Wifi", "Kitchen", "Hot tub", "BBQ gr... | $118.00 | 1 | 365 | 1 | 1 | 365 | 365 | 1.0 | 365.0 | NaN | t | 30 | 60 | 90 | 179 | 2023-09-04 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | t | 11 | 11 | 0 | 0 | NaN |
| 23184 | 971632586652679242 | https://www.airbnb.com/rooms/971632586652679242 | 20230904181306 | 2023-09-04 | city scrape | Home in Springvale · ★New · Studio · 1 bed · 0... | The room is spacious with built in robe and a ... | NaN | https://a0.muscache.com/pictures/hosting/Hosti... | 395737294 | https://www.airbnb.com/users/show/395737294 | Ratana | 2021-04-06 | Springvale, Australia | I am a professional engineer, running my own e... | within an hour | 100% | 63% | f | https://a0.muscache.com/im/pictures/user/c3e11... | https://a0.muscache.com/im/pictures/user/c3e11... | NaN | 3.0 | 3.0 | ['email', 'phone'] | t | t | NaN | Greater Dandenong | NaN | -37.951797 | 145.157050 | Private room in home | Private room | 2 | NaN | 0 shared baths | NaN | 1.0 | ["Security cameras on property", "Smoke alarm"... | $80.00 | 1 | 365 | 1 | 1 | 365 | 365 | 1.0 | 365.0 | NaN | t | 4 | 34 | 64 | 339 | 2023-09-04 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | f | 3 | 0 | 3 | 0 | NaN |
Using .describe() method generates a summary statistics of the DataFrame. This method returns count, mean, standard deviation, minimum and maximum values, and the quartile of the data for each column. It's a quick way to understand the distribution of each column in your data.
# Check summary statistics
data.describe()
| id | scrape_id | host_id | host_listings_count | host_total_listings_count | neighbourhood_group_cleansed | latitude | longitude | accommodates | bathrooms | bedrooms | beds | minimum_nights | maximum_nights | minimum_minimum_nights | maximum_minimum_nights | minimum_maximum_nights | maximum_maximum_nights | minimum_nights_avg_ntm | maximum_nights_avg_ntm | calendar_updated | availability_30 | availability_60 | availability_90 | availability_365 | number_of_reviews | number_of_reviews_ltm | number_of_reviews_l30d | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | license | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2.318500e+04 | 2.318500e+04 | 2.318500e+04 | 23183.000000 | 23183.000000 | 0.0 | 23185.000000 | 23185.000000 | 23185.000000 | 0.0 | 16888.000000 | 23002.000000 | 23185.000000 | 23185.000000 | 23185.000000 | 23185.000000 | 2.318500e+04 | 2.318500e+04 | 23185.000000 | 2.318500e+04 | 0.0 | 23185.000000 | 23185.000000 | 23185.000000 | 23185.000000 | 23185.000000 | 23185.000000 | 23185.000000 | 18675.000000 | 18429.000000 | 18433.000000 | 18427.000000 | 18432.000000 | 18427.000000 | 18427.000000 | 0.0 | 23185.000000 | 23185.000000 | 23185.000000 | 23185.000000 | 18675.000000 |
| mean | 3.472407e+17 | 2.023090e+13 | 1.607668e+08 | 28.485140 | 49.072510 | NaN | -37.828206 | 145.013870 | 3.659521 | NaN | 1.974420 | 2.127858 | 5.305499 | 560.395212 | 4.851369 | 6.089411 | 1.019553e+06 | 1.019580e+06 | 5.466875 | 1.019567e+06 | NaN | 11.431011 | 25.594609 | 40.532327 | 137.641320 | 30.803666 | 9.655769 | 0.716066 | 4.643992 | 4.748298 | 4.663471 | 4.779708 | 4.811204 | 4.824680 | 4.659304 | NaN | 17.263274 | 12.863403 | 4.246927 | 0.101100 | 1.370684 |
| std | 4.043073e+17 | 4.847761e+00 | 1.585790e+08 | 101.397683 | 199.137068 | NaN | 0.074543 | 0.150682 | 2.357757 | NaN | 1.076814 | 1.759124 | 29.116198 | 806.061758 | 25.774833 | 28.565941 | 4.676585e+07 | 4.676585e+07 | 26.640337 | 4.676585e+07 | NaN | 11.490477 | 23.304951 | 35.290262 | 131.909717 | 60.806538 | 17.740837 | 1.465085 | 0.696825 | 0.440869 | 0.493951 | 0.415675 | 0.410474 | 0.332126 | 0.470684 | NaN | 49.074356 | 33.107832 | 29.576002 | 0.967221 | 1.633770 |
| min | 1.080300e+04 | 2.023090e+13 | 7.834000e+03 | 1.000000 | 1.000000 | NaN | -38.225690 | 144.518047 | 1.000000 | NaN | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000e+00 | 1.000000e+00 | 1.000000 | 1.000000e+00 | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | NaN | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.010000 |
| 25% | 2.359906e+07 | 2.023090e+13 | 3.039392e+07 | 1.000000 | 1.000000 | NaN | -37.854080 | 144.956622 | 2.000000 | NaN | 1.000000 | 1.000000 | 1.000000 | 90.000000 | 1.000000 | 2.000000 | 1.800000e+02 | 3.650000e+02 | 1.000000 | 3.650000e+02 | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 4.600000 | 4.700000 | 4.550000 | 4.740000 | 4.800000 | 4.790000 | 4.570000 | NaN | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.240000 |
| 50% | 4.792597e+07 | 2.023090e+13 | 9.879526e+07 | 2.000000 | 3.000000 | NaN | -37.819440 | 144.977240 | 3.000000 | NaN | 2.000000 | 2.000000 | 2.000000 | 365.000000 | 2.000000 | 2.000000 | 1.125000e+03 | 1.125000e+03 | 2.000000 | 1.125000e+03 | NaN | 8.000000 | 25.000000 | 42.000000 | 91.000000 | 7.000000 | 2.000000 | 0.000000 | 4.830000 | 4.890000 | 4.820000 | 4.920000 | 4.950000 | 4.920000 | 4.780000 | NaN | 2.000000 | 1.000000 | 0.000000 | 0.000000 | 0.880000 |
| 75% | 7.894182e+17 | 2.023090e+13 | 2.628817e+08 | 10.000000 | 15.000000 | NaN | -37.802090 | 145.022480 | 4.000000 | NaN | 2.000000 | 3.000000 | 3.000000 | 1125.000000 | 3.000000 | 5.000000 | 1.125000e+03 | 1.125000e+03 | 3.700000 | 1.125000e+03 | NaN | 23.000000 | 49.000000 | 75.000000 | 267.000000 | 32.000000 | 13.000000 | 1.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 4.940000 | NaN | 8.000000 | 7.000000 | 1.000000 | 0.000000 | 1.990000 |
| max | 9.729284e+17 | 2.023090e+13 | 5.351276e+08 | 849.000000 | 1755.000000 | NaN | -37.484800 | 145.844997 | 16.000000 | NaN | 14.000000 | 117.000000 | 1125.000000 | 99999.000000 | 1000.000000 | 1125.000000 | 2.147484e+09 | 2.147484e+09 | 1000.000000 | 2.147484e+09 | NaN | 30.000000 | 60.000000 | 90.000000 | 365.000000 | 987.000000 | 922.000000 | 40.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | NaN | 327.000000 | 250.000000 | 251.000000 | 20.000000 | 44.060000 |
With .info() method we get a concise summary of the DataFrame. That method prints information about the DataFrame including the index, columns, non-null values, dtype and memory usage. It's a quick way to understand the structure of your data, including the number of entries, the data type of each column, and the amount of memory used.
# Check DataFrame information
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 23185 entries, 0 to 23184 Data columns (total 75 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 23185 non-null int64 1 listing_url 23185 non-null object 2 scrape_id 23185 non-null int64 3 last_scraped 23185 non-null object 4 source 23185 non-null object 5 name 23185 non-null object 6 description 22822 non-null object 7 neighborhood_overview 13256 non-null object 8 picture_url 23185 non-null object 9 host_id 23185 non-null int64 10 host_url 23185 non-null object 11 host_name 23183 non-null object 12 host_since 23183 non-null object 13 host_location 17286 non-null object 14 host_about 12833 non-null object 15 host_response_time 15418 non-null object 16 host_response_rate 15418 non-null object 17 host_acceptance_rate 16579 non-null object 18 host_is_superhost 22468 non-null object 19 host_thumbnail_url 23183 non-null object 20 host_picture_url 23183 non-null object 21 host_neighbourhood 8772 non-null object 22 host_listings_count 23183 non-null float64 23 host_total_listings_count 23183 non-null float64 24 host_verifications 23185 non-null object 25 host_has_profile_pic 23183 non-null object 26 host_identity_verified 23183 non-null object 27 neighbourhood 13257 non-null object 28 neighbourhood_cleansed 23185 non-null object 29 neighbourhood_group_cleansed 0 non-null float64 30 latitude 23185 non-null float64 31 longitude 23185 non-null float64 32 property_type 23185 non-null object 33 room_type 23185 non-null object 34 accommodates 23185 non-null int64 35 bathrooms 0 non-null float64 36 bathrooms_text 23174 non-null object 37 bedrooms 16888 non-null float64 38 beds 23002 non-null float64 39 amenities 23185 non-null object 40 price 23185 non-null object 41 minimum_nights 23185 non-null int64 42 maximum_nights 23185 non-null int64 43 minimum_minimum_nights 23185 non-null int64 44 maximum_minimum_nights 23185 non-null int64 45 minimum_maximum_nights 23185 non-null int64 46 maximum_maximum_nights 23185 non-null int64 47 minimum_nights_avg_ntm 23185 non-null float64 48 maximum_nights_avg_ntm 23185 non-null float64 49 calendar_updated 0 non-null float64 50 has_availability 23185 non-null object 51 availability_30 23185 non-null int64 52 availability_60 23185 non-null int64 53 availability_90 23185 non-null int64 54 availability_365 23185 non-null int64 55 calendar_last_scraped 23185 non-null object 56 number_of_reviews 23185 non-null int64 57 number_of_reviews_ltm 23185 non-null int64 58 number_of_reviews_l30d 23185 non-null int64 59 first_review 18675 non-null object 60 last_review 18675 non-null object 61 review_scores_rating 18675 non-null float64 62 review_scores_accuracy 18429 non-null float64 63 review_scores_cleanliness 18433 non-null float64 64 review_scores_checkin 18427 non-null float64 65 review_scores_communication 18432 non-null float64 66 review_scores_location 18427 non-null float64 67 review_scores_value 18427 non-null float64 68 license 0 non-null float64 69 instant_bookable 23185 non-null object 70 calculated_host_listings_count 23185 non-null int64 71 calculated_host_listings_count_entire_homes 23185 non-null int64 72 calculated_host_listings_count_private_rooms 23185 non-null int64 73 calculated_host_listings_count_shared_rooms 23185 non-null int64 74 reviews_per_month 18675 non-null float64 dtypes: float64(20), int64(21), object(34) memory usage: 13.3+ MB
The .columns attribute returns the names of the columns in the DataFrame. It's an efficient method to identify all the features or variables you have in your data.
# Check DataFrame column names
data.columns
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
'description', 'neighborhood_overview', 'picture_url', 'host_id',
'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
'host_response_time', 'host_response_rate', 'host_acceptance_rate',
'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
'host_neighbourhood', 'host_listings_count',
'host_total_listings_count', 'host_verifications',
'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
'maximum_minimum_nights', 'minimum_maximum_nights',
'maximum_maximum_nights', 'minimum_nights_avg_ntm',
'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability',
'availability_30', 'availability_60', 'availability_90',
'availability_365', 'calendar_last_scraped', 'number_of_reviews',
'number_of_reviews_ltm', 'number_of_reviews_l30d', 'first_review',
'last_review', 'review_scores_rating', 'review_scores_accuracy',
'review_scores_cleanliness', 'review_scores_checkin',
'review_scores_communication', 'review_scores_location',
'review_scores_value', 'license', 'instant_bookable',
'calculated_host_listings_count',
'calculated_host_listings_count_entire_homes',
'calculated_host_listings_count_private_rooms',
'calculated_host_listings_count_shared_rooms', 'reviews_per_month'],
dtype='object')
The .nunique() method identifies the number of unique values in each column of the DataFrame. This can be useful to understand the diversity of values in each variable in the DataFrame.
# Identify unique values by each column
data.nunique()
id 23185 listing_url 23185 scrape_id 1 last_scraped 2 source 2 name 16587 description 21358 neighborhood_overview 10985 picture_url 22651 host_id 13395 host_url 13395 host_name 5353 host_since 3942 host_location 957 host_about 6468 host_response_time 4 host_response_rate 62 host_acceptance_rate 94 host_is_superhost 2 host_thumbnail_url 13074 host_picture_url 13074 host_neighbourhood 117 host_listings_count 97 host_total_listings_count 135 host_verifications 8 host_has_profile_pic 2 host_identity_verified 2 neighbourhood 515 neighbourhood_cleansed 30 neighbourhood_group_cleansed 0 latitude 16261 longitude 16767 property_type 83 room_type 4 accommodates 16 bathrooms 0 bathrooms_text 38 bedrooms 12 beds 23 amenities 21892 price 924 minimum_nights 83 maximum_nights 192 minimum_minimum_nights 79 maximum_minimum_nights 88 minimum_maximum_nights 176 maximum_maximum_nights 174 minimum_nights_avg_ntm 300 maximum_nights_avg_ntm 601 calendar_updated 0 has_availability 2 availability_30 31 availability_60 61 availability_90 91 availability_365 366 calendar_last_scraped 2 number_of_reviews 442 number_of_reviews_ltm 137 number_of_reviews_l30d 20 first_review 3168 last_review 2172 review_scores_rating 163 review_scores_accuracy 157 review_scores_cleanliness 187 review_scores_checkin 168 review_scores_communication 152 review_scores_location 116 review_scores_value 162 license 0 instant_bookable 2 calculated_host_listings_count 63 calculated_host_listings_count_entire_homes 62 calculated_host_listings_count_private_rooms 19 calculated_host_listings_count_shared_rooms 9 reviews_per_month 822 dtype: int64
Using .isnull().sum() method gives you a summary of how many missing values there are in each column in the DataFrame. It's a crucial first step in understanding the quality and completeness of the data.
# Check for missing values
data.isnull().sum()
id 0 listing_url 0 scrape_id 0 last_scraped 0 source 0 name 0 description 363 neighborhood_overview 9929 picture_url 0 host_id 0 host_url 0 host_name 2 host_since 2 host_location 5899 host_about 10352 host_response_time 7767 host_response_rate 7767 host_acceptance_rate 6606 host_is_superhost 717 host_thumbnail_url 2 host_picture_url 2 host_neighbourhood 14413 host_listings_count 2 host_total_listings_count 2 host_verifications 0 host_has_profile_pic 2 host_identity_verified 2 neighbourhood 9928 neighbourhood_cleansed 0 neighbourhood_group_cleansed 23185 latitude 0 longitude 0 property_type 0 room_type 0 accommodates 0 bathrooms 23185 bathrooms_text 11 bedrooms 6297 beds 183 amenities 0 price 0 minimum_nights 0 maximum_nights 0 minimum_minimum_nights 0 maximum_minimum_nights 0 minimum_maximum_nights 0 maximum_maximum_nights 0 minimum_nights_avg_ntm 0 maximum_nights_avg_ntm 0 calendar_updated 23185 has_availability 0 availability_30 0 availability_60 0 availability_90 0 availability_365 0 calendar_last_scraped 0 number_of_reviews 0 number_of_reviews_ltm 0 number_of_reviews_l30d 0 first_review 4510 last_review 4510 review_scores_rating 4510 review_scores_accuracy 4756 review_scores_cleanliness 4752 review_scores_checkin 4758 review_scores_communication 4753 review_scores_location 4758 review_scores_value 4758 license 23185 instant_bookable 0 calculated_host_listings_count 0 calculated_host_listings_count_entire_homes 0 calculated_host_listings_count_private_rooms 0 calculated_host_listings_count_shared_rooms 0 reviews_per_month 4510 dtype: int64
Here we have a summary of the percentage of missing values.
# Check percentage of missing values
round(data.isnull().sum() / len(data) * 100,2)
id 0.00 listing_url 0.00 scrape_id 0.00 last_scraped 0.00 source 0.00 name 0.00 description 1.57 neighborhood_overview 42.83 picture_url 0.00 host_id 0.00 host_url 0.00 host_name 0.01 host_since 0.01 host_location 25.44 host_about 44.65 host_response_time 33.50 host_response_rate 33.50 host_acceptance_rate 28.49 host_is_superhost 3.09 host_thumbnail_url 0.01 host_picture_url 0.01 host_neighbourhood 62.17 host_listings_count 0.01 host_total_listings_count 0.01 host_verifications 0.00 host_has_profile_pic 0.01 host_identity_verified 0.01 neighbourhood 42.82 neighbourhood_cleansed 0.00 neighbourhood_group_cleansed 100.00 latitude 0.00 longitude 0.00 property_type 0.00 room_type 0.00 accommodates 0.00 bathrooms 100.00 bathrooms_text 0.05 bedrooms 27.16 beds 0.79 amenities 0.00 price 0.00 minimum_nights 0.00 maximum_nights 0.00 minimum_minimum_nights 0.00 maximum_minimum_nights 0.00 minimum_maximum_nights 0.00 maximum_maximum_nights 0.00 minimum_nights_avg_ntm 0.00 maximum_nights_avg_ntm 0.00 calendar_updated 100.00 has_availability 0.00 availability_30 0.00 availability_60 0.00 availability_90 0.00 availability_365 0.00 calendar_last_scraped 0.00 number_of_reviews 0.00 number_of_reviews_ltm 0.00 number_of_reviews_l30d 0.00 first_review 19.45 last_review 19.45 review_scores_rating 19.45 review_scores_accuracy 20.51 review_scores_cleanliness 20.50 review_scores_checkin 20.52 review_scores_communication 20.50 review_scores_location 20.52 review_scores_value 20.52 license 100.00 instant_bookable 0.00 calculated_host_listings_count 0.00 calculated_host_listings_count_entire_homes 0.00 calculated_host_listings_count_private_rooms 0.00 calculated_host_listings_count_shared_rooms 0.00 reviews_per_month 19.45 dtype: float64
Below I used .duplicated().sum() to identify duplicate rows in the DataFrame. Here, no duplicates were identified. This is a useful step to spot and handle redundant data that we might want to remove to ensure the quality and accuracy of our analysis. The next stage is Data Cleaning.
# Identify duplicates
data.duplicated().sum()
0
In the Data Cleaning process, each of the following steps is crucial in ensuring that the DataFrame is clean, consistent, and ready for analysis. They help improve the accuracy and reliability of the analysis.
Remove Unnecessary Columns: This step involves eliminating columns that are not required for the analysis. These could be columns that contain redundant or irrelevant information.
Handling Missing Values: This step deals with missing or null values in the DataFrame. Strategies to handle them can include filling them with a specific value (like mean, median, or mode) or removing the rows/columns with missing values entirely.
Data Type Conversion: Sometimes, the data types of the columns might not be appropriate for the analysis. For example, a numerical column might be stored as strings. This step involves converting such columns to their correct data type.
Handling Inconsistent Data: This process corrects any discrepancies and improper formatting to ensure data consistency.
Removing Duplicates: This step involves identifying and removing duplicate rows in the DataFrame. Duplicate data can skew the analysis and lead to incorrect conclusions.
Outlier Detection and Treatment: Outliers are data points that significantly deviate from other observations. They can be genuine variations or occur due to errors. This step involves identifying these outliers and determining the appropriate treatment, which could involve removal or transformation.
I'm going to review the columns names once more, determine which ones are relevant, and select them to find the answers.
# Check column names
data.columns
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
'description', 'neighborhood_overview', 'picture_url', 'host_id',
'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
'host_response_time', 'host_response_rate', 'host_acceptance_rate',
'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
'host_neighbourhood', 'host_listings_count',
'host_total_listings_count', 'host_verifications',
'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
'maximum_minimum_nights', 'minimum_maximum_nights',
'maximum_maximum_nights', 'minimum_nights_avg_ntm',
'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability',
'availability_30', 'availability_60', 'availability_90',
'availability_365', 'calendar_last_scraped', 'number_of_reviews',
'number_of_reviews_ltm', 'number_of_reviews_l30d', 'first_review',
'last_review', 'review_scores_rating', 'review_scores_accuracy',
'review_scores_cleanliness', 'review_scores_checkin',
'review_scores_communication', 'review_scores_location',
'review_scores_value', 'license', 'instant_bookable',
'calculated_host_listings_count',
'calculated_host_listings_count_entire_homes',
'calculated_host_listings_count_private_rooms',
'calculated_host_listings_count_shared_rooms', 'reviews_per_month'],
dtype='object')
Before proceeding with the Data Cleaning process, I will create a backcup of the original DataFrame to preserve our initial data. Then, I will select a subset of columns that are pertinent to our analysis. This approach is crucial in the Data Cleaning process as it allows us to focus on the most significant variables, thereby making the subsequent analysis more efficient and manageable.
# Drop irrelevant columns (making a DataFrame backup!)
data_v2 = data[[
# 'id',
'listing_url',
# 'scrape_id', 'last_scraped', 'source',
'name',
# 'description', 'neighborhood_overview', 'picture_url',
'host_id',
# 'host_url', 'host_name',
# 'host_since', 'host_location', 'host_about',
# 'host_response_time',
# 'host_response_rate', 'host_acceptance_rate',
# 'host_is_superhost',
# 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
# 'host_listings_count', 'host_total_listings_count',
# 'host_verifications',
# 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
'neighbourhood_cleansed',
# 'neighbourhood_group_cleansed',
'latitude', 'longitude',
# 'property_type',
'room_type', 'accommodates',
# 'bathrooms',
'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
'minimum_nights',
# 'maximum_nights', 'minimum_minimum_nights',
# 'maximum_minimum_nights', 'minimum_maximum_nights',
# 'maximum_maximum_nights', 'minimum_nights_avg_ntm',
# 'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability',
# 'availability_30', 'availability_60', 'availability_90',
# 'availability_365', 'calendar_last_scraped',
'number_of_reviews',
# 'number_of_reviews_ltm', 'number_of_reviews_l30d',
# 'first_review', 'last_review', 'review_scores_rating',
# 'review_scores_accuracy',
# 'review_scores_cleanliness', 'review_scores_checkin',
# 'review_scores_communication', 'review_scores_location',
# 'review_scores_value',
# 'license', 'instant_bookable', 'calculated_host_listings_count',
# 'calculated_host_listings_count_entire_homes',
# 'calculated_host_listings_count_private_rooms',
# 'calculated_host_listings_count_shared_rooms',
# 'reviews_per_month'
]]
By using the .shape attribute, we can observe that our DataFrame now consists of 23185 rows and 15 relevant columns mentioned below:
# Check new DataFrame dimension
data_v2.shape
(23185, 15)
I will use the .head() once again to get a quick overview of the new Dataframe.
# Preview the first 3 rows of the new DataFrame
data_v2.head(3)
| listing_url | name | host_id | neighbourhood_cleansed | latitude | longitude | room_type | accommodates | bathrooms_text | bedrooms | beds | amenities | price | minimum_nights | number_of_reviews | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://www.airbnb.com/rooms/10803 | Rental unit in Brunswick East · ★4.49 · 1 bedr... | 38901 | Moreland | -37.76606 | 144.97951 | Private room | 2 | 1 shared bath | NaN | 1.0 | ["Microwave", "Hot water kettle", "Drying rack... | $49.00 | 5 | 173 |
| 1 | https://www.airbnb.com/rooms/12936 | Rental unit in St Kilda · ★4.68 · 1 bedroom · ... | 50121 | Port Phillip | -37.85999 | 144.97662 | Entire home/apt | 2 | 1 bath | 1.0 | 1.0 | ["Microwave", "Lockbox", "Hot water", "Smoke a... | $95.00 | 3 | 42 |
| 2 | https://www.airbnb.com/rooms/38271 | Rental unit in Berwick · ★4.86 · 3 bedrooms · ... | 164193 | Casey | -38.05723 | 145.33982 | Entire home/apt | 5 | 1 bath | 3.0 | 3.0 | ["Microwave", "Hot water kettle", "Drying rack... | $116.00 | 1 | 228 |
By using .isnull().sum() method, we can see that there are 3 columns ('bathrooms_text', 'bedrooms', 'beds') with missing values. Let's proceed to handle these missing values.
# Check for missing values
data_v2.isnull().sum()
listing_url 0 name 0 host_id 0 neighbourhood_cleansed 0 latitude 0 longitude 0 room_type 0 accommodates 0 bathrooms_text 11 bedrooms 6297 beds 183 amenities 0 price 0 minimum_nights 0 number_of_reviews 0 dtype: int64
This line of code is a common way to identify and inspect rows with missing values in a specific column of a DataFrame. Here, I'm examining the 'bathrooms_text' column which has been found to contain 11 missing values.
# Check "bathrooms_text" column missing values
data_v2[data_v2['bathrooms_text'].isnull()]
| listing_url | name | host_id | neighbourhood_cleansed | latitude | longitude | room_type | accommodates | bathrooms_text | bedrooms | beds | amenities | price | minimum_nights | number_of_reviews | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 120 | https://www.airbnb.com/rooms/268620 | Home in Caroline Springs · 1 bedroom | 1407948 | Melton | -37.73975 | 144.73742 | Private room | 1 | NaN | NaN | NaN | [] | $72.00 | 1 | 0 |
| 211 | https://www.airbnb.com/rooms/508590 | Home in Balwyn North · ★4.63 · 1 bedroom | 2255480 | Boroondara | -37.79370 | 145.07935 | Private room | 1 | NaN | NaN | NaN | [] | $50.00 | 1 | 16 |
| 6523 | https://www.airbnb.com/rooms/26888053 | Nature lodge in Mt Dandenong · ★4.78 · 1 bed | 156749664 | Yarra Ranges | -37.82976 | 145.35228 | Hotel room | 2 | NaN | NaN | 1.0 | ["Clothing storage: closet", "Microwave", "Hot... | $325.00 | 1 | 41 |
| 6537 | https://www.airbnb.com/rooms/26904346 | Nature lodge in Mt Dandenong · ★5.0 · 1 bedroo... | 156749664 | Yarra Ranges | -37.83003 | 145.35347 | Hotel room | 2 | NaN | 1.0 | 1.0 | ["Microwave", "Hot water kettle", "Carbon mono... | $364.00 | 1 | 3 |
| 7643 | https://www.airbnb.com/rooms/31025816 | Aparthotel in Docklands · ★4.21 · 2 bedrooms ·... | 16970146 | Melbourne | -37.81403 | 144.94222 | Hotel room | 4 | NaN | 2.0 | 2.0 | ["Microwave", "Coffee maker", "High chair", "H... | $282.00 | 1 | 19 |
| 8388 | https://www.airbnb.com/rooms/34046314 | Serviced apartment in Southbank · ★4.84 · 2 be... | 255905173 | Melbourne | -37.82815 | 144.96857 | Hotel room | 4 | NaN | 2.0 | 2.0 | ["Microwave", "Carbon monoxide alarm", "Hot wa... | $353.00 | 1 | 44 |
| 8641 | https://www.airbnb.com/rooms/35914383 | Serviced apartment in Williamstown · ★4.67 · 2... | 270134235 | Hobsons Bay | -37.85641 | 144.87878 | Hotel room | 4 | NaN | 2.0 | 3.0 | ["Smoke alarm", "Wifi", "Kitchen", "Dryer", "I... | $394.00 | 6 | 9 |
| 8696 | https://www.airbnb.com/rooms/35717112 | Home in Melton West · Studio · 1 bed | 268666048 | Melton | -37.66411 | 144.56059 | Private room | 2 | NaN | NaN | 1.0 | ["Smoke alarm", "Wifi", "Kitchen", "Private li... | $75.00 | 1 | 0 |
| 9051 | https://www.airbnb.com/rooms/38076289 | Aparthotel in Docklands · ★4.26 · 3 bedrooms ·... | 16970146 | Melbourne | -37.81403 | 144.94222 | Hotel room | 6 | NaN | 3.0 | 4.0 | ["Microwave", "Coffee maker", "High chair", "H... | $429.00 | 2 | 42 |
| 9238 | https://www.airbnb.com/rooms/38883439 | Serviced apartment in Point Cook · ★4.17 · 2 b... | 296379242 | Wyndham | -37.89366 | 144.76554 | Hotel room | 4 | NaN | 2.0 | 2.0 | ["Security cameras on property", "Smoke alarm"... | $269.00 | 1 | 6 |
| 10898 | https://www.airbnb.com/rooms/43540050 | Villa in Murrumbeena · ★4.91 · Studio · 1 bed | 166664813 | Glen Eira | -37.90158 | 145.06935 | Private room | 2 | NaN | NaN | 1.0 | ["Security cameras on property", "Wifi", "Kitc... | $40.00 | 7 | 23 |
All the 11 missing values in the "bathrooms_text" column correspond to properties with no specified number of bathrooms. Therefore, I've decided to fill these NaN values with '0 baths'.
# Fill all the 11 "bathrooms_text" missing values
data_v2.loc[data_v2['bathrooms_text'].isnull(), 'bathrooms_text'] = '0 baths'
Now, I'm going to check if that was treated.
# Check "bathrooms_text" missing values
data_v2['bathrooms_text'].isnull().sum()
0
After treating the 'bathrooms_text' missing values, I will examine the 'bedrooms' column, which has been found to contain 6297 missing values. Here, I'm just previewing the first 10 rows.
# Check "bedrooms" column empty values
data_v2.loc[data_v2['bedrooms'].isnull()].head(10)
| listing_url | name | host_id | neighbourhood_cleansed | latitude | longitude | room_type | accommodates | bathrooms_text | bedrooms | beds | amenities | price | minimum_nights | number_of_reviews | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://www.airbnb.com/rooms/10803 | Rental unit in Brunswick East · ★4.49 · 1 bedr... | 38901 | Moreland | -37.766060 | 144.979510 | Private room | 2 | 1 shared bath | NaN | 1.0 | ["Microwave", "Hot water kettle", "Drying rack... | $49.00 | 5 | 173 |
| 3 | https://www.airbnb.com/rooms/41836 | Home in Reservoir · ★4.71 · 1 bedroom · 1 bed ... | 182833 | Darebin | -37.697610 | 145.000660 | Private room | 2 | 1 shared bath | NaN | 1.0 | ["Smoking allowed", "Smoke alarm", "Wifi", "Ki... | $40.00 | 7 | 159 |
| 7 | https://www.airbnb.com/rooms/603007 | Rental unit in South Yarra · ★4.76 · 1 bedroom... | 2705870 | Melbourne | -37.830220 | 144.986820 | Private room | 2 | 1 shared bath | NaN | 1.0 | ["Private living room", "Carbon monoxide alarm... | $59.00 | 1 | 21 |
| 15 | https://www.airbnb.com/rooms/628156 | Condo in Port Melbourne · ★4.82 · 1 bedroom · ... | 3119669 | Port Phillip | -37.841580 | 144.937940 | Private room | 2 | 1 private bath | NaN | 1.0 | ["Stainless steel gas stove", "Microwave", "AC... | $90.00 | 5 | 143 |
| 17 | https://www.airbnb.com/rooms/44699 | Condo in South Melbourne · ★4.75 · 1 bedroom ·... | 189245 | Port Phillip | -37.831557 | 144.970644 | Private room | 2 | 1 shared bath | NaN | 1.0 | ["Stainless steel gas stove", "Microwave", "St... | $88.00 | 1 | 54 |
| 20 | https://www.airbnb.com/rooms/1234561 | Home in St Kilda · ★4.77 · 1 bedroom · 1 bed ·... | 6731712 | Port Phillip | -37.858440 | 144.988940 | Private room | 1 | 1 bath | NaN | 1.0 | ["Smoking allowed", "Wifi", "Kitchen", "Dedica... | $39.00 | 2 | 23 |
| 27 | https://www.airbnb.com/rooms/67211 | Rental unit in Kew · ★4.82 · 1 bedroom · 1 bed... | 326880 | Boroondara | -37.803470 | 145.035830 | Private room | 1 | 1 shared bath | NaN | 1.0 | ["Microwave", "Free street parking", "Lockbox"... | $50.00 | 2 | 225 |
| 30 | https://www.airbnb.com/rooms/70328 | Townhouse in Mordialloc · ★4.94 · 1 bedroom · ... | 356006 | Kingston | -38.007300 | 145.085440 | Private room | 2 | 1 private bath | NaN | 1.0 | ["Clothing storage: closet", "Microwave", "Hot... | $65.00 | 1 | 147 |
| 32 | https://www.airbnb.com/rooms/78143 | Townhouse in Prahran · ★4.79 · 2 bedrooms · 2 ... | 419767 | Stonnington | -37.851620 | 144.989430 | Private room | 2 | 1 private bath | NaN | 2.0 | ["Microwave", "Hot water kettle", "Shared outd... | $96.00 | 2 | 223 |
| 33 | https://www.airbnb.com/rooms/671981 | Home in Yarraville · 1 bedroom · 1 bed · 1 sha... | 3395313 | Maribyrnong | -37.815070 | 144.888950 | Shared room | 1 | 1 shared bath | NaN | 1.0 | ["TV"] | $100.00 | 1 | 0 |
Here, I will create a new DataFrame 'bedrooms_info' that contains the split and cleaned information from the 'name' column where we have 'bedrooms' information. This information can then be used to fill the missing values in the 'bedrooms' column.
# Extract values from the column "name" to fill missing values in the "bedrooms" column
bedrooms_info = data_v2[data_v2['bedrooms'].isnull()]['name'].str.split('·', expand=True).applymap(lambda x: x.strip() if x is not None else x).copy()
We've managed to extract information about bedrooms from the 'name' column. We can observe that the second and third columns contain bedroom information. This can be used to handle missing values in the 'bedrooms' column.
# Preview the first 10 rows
bedrooms_info.head(10)
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| 0 | Rental unit in Brunswick East | ★4.49 | 1 bedroom | 1 bed | 1 shared bath |
| 3 | Home in Reservoir | ★4.71 | 1 bedroom | 1 bed | 1 shared bath |
| 7 | Rental unit in South Yarra | ★4.76 | 1 bedroom | 1 bed | 1 shared bath |
| 15 | Condo in Port Melbourne | ★4.82 | 1 bedroom | 1 bed | 1 private bath |
| 17 | Condo in South Melbourne | ★4.75 | 1 bedroom | 1 bed | 1 shared bath |
| 20 | Home in St Kilda | ★4.77 | 1 bedroom | 1 bed | 1 bath |
| 27 | Rental unit in Kew | ★4.82 | 1 bedroom | 1 bed | 1 shared bath |
| 30 | Townhouse in Mordialloc | ★4.94 | 1 bedroom | 1 bed | 1 private bath |
| 32 | Townhouse in Prahran | ★4.79 | 2 bedrooms | 2 beds | 1 private bath |
| 33 | Home in Yarraville | 1 bedroom | 1 bed | 1 shared bath | None |
As mentioned above, we have bedroom information in the second and third columns. So, I will use the .value_counts() method to count the distribution of each value, in this case, in the second column.
# Check the values from the second column to extract the bedroom information
bedrooms_info[1].value_counts()
1 bedroom 2121 ★5.0 727 ★New 336 2 bedrooms 226 Studio 200 ★4.67 121 ★4.75 101 ★4.80 93 ★4.83 87 ★4.50 84 ★4.86 80 ★4.91 78 ★4.94 76 ★4.90 74 ★4.95 72 ★4.0 71 ★4.89 70 ★4.88 66 ★4.93 65 ★4.33 65 ★4.92 62 ★4.96 60 ★4.78 59 ★4.85 59 3 bedrooms 56 ★4.97 50 ★4.82 42 ★4.71 40 ★4.87 38 ★4.60 38 ★4.25 38 ★4.79 36 ★4.69 33 ★4.77 32 ★4.98 30 ★4.81 30 ★4.56 28 ★4.84 26 ★4.38 26 ★4.64 25 ★4.76 24 ★4.72 23 ★4.63 23 ★4.73 22 4 bedrooms 21 ★4.74 19 ★4.70 19 ★4.68 18 ★4.58 17 ★4.40 17 ★4.65 17 ★4.62 16 ★4.57 16 ★4.44 16 ★4.55 15 ★4.53 14 ★4.54 14 ★4.43 13 ★3.67 13 ★4.61 13 ★4.47 13 ★3.33 11 ★3.50 11 ★4.59 11 ★4.20 11 ★4.42 11 ★4.36 11 ★4.45 10 ★4.46 10 ★3.75 9 ★4.27 8 ★3.83 8 ★4.17 7 ★4.37 7 ★4.29 7 ★3.0 7 ★4.66 7 ★4.41 7 ★4.52 7 ★4.14 7 ★3.60 6 ★4.13 6 ★4.31 6 ★4.22 5 ★3.80 5 ★4.99 5 ★4.09 5 ★4.11 5 ★4.30 5 ★4.35 5 ★4.39 4 ★0.0 4 ★4.26 4 ★4.16 4 1 bed 4 ★4.10 4 ★3.86 3 6 bedrooms 3 ★4.08 3 ★4.49 3 ★4.51 3 ★4.24 3 ★4.15 3 ★4.28 3 5 bedrooms 3 ★4.04 2 ★3.89 2 ★4.23 2 ★4.21 2 ★3.57 2 ★4.34 2 ★2.33 2 ★4.19 2 ★4.32 2 ★4.06 2 ★2.67 2 ★2.0 1 ★3.68 1 ★3.20 1 ★4.48 1 ★3.79 1 ★3.91 1 ★2.80 1 3 beds 1 ★3.85 1 ★2.20 1 ★3.92 1 7 bedrooms 1 ★3.94 1 ★2.75 1 ★3.25 1 ★4.07 1 ★3.40 1 ★3.93 1 2 beds 1 ★3.90 1 Name: 1, dtype: int64
The results from the last cell show a variety of values. However, we're specifically interested in the number of bedrooms. To extract this information, I'll create a list with specific values.
# Create a list to extract the bedroom information from the second column in bedrooms_info
specific_values_1 = ['Studio', '1 bedroom', '2 bedrooms', '3 bedrooms',
'4 bedrooms', '5 bedrooms', '6 bedrooms',
'7 bedrooms']
Here, I'll replicate the previous process to count the distribution of each value, this time focusing on the third column.
# Check the values from the third column to extract the bedroom information
bedrooms_info[2].value_counts()
1 bedroom 2801 1 bed 2158 Studio 529 2 beds 309 2 bedrooms 215 3 beds 63 3 bedrooms 57 4 beds 33 1 shared bath 29 4 bedrooms 25 1 bath 15 1 private bath 15 5 beds 12 6 bedrooms 7 5 bedrooms 6 2 baths 3 1.5 shared baths 3 14 bedrooms 3 0 shared baths 2 1.5 baths 2 2 shared baths 2 2.5 shared baths 2 8 beds 1 7 bedrooms 1 0 baths 1 6 beds 1 7 beds 1 Name: 2, dtype: int64
In this cell, I'm going to create a new list with specific values to extract bedrooms info from the third column.
# Create a list to extract the bedroom information from the third column in bedrooms_info
specific_values_2 = ['Studio', '1 bedroom', '2 bedrooms', '3 bedrooms',
'4 bedrooms', '5 bedrooms', '6 bedrooms',
'7 bedrooms', '14 bedrooms']
After creating the list 'specific_values_1', I'll extract rows from the DataFrame bedrooms_info where the second column's value is in 'specific_values_1'. I'll use the .isin() function for this and save the result in bedrooms_info_1.
# Extract values from the second column
bedrooms_info_1 = bedrooms_info[bedrooms_info[1].isin(specific_values_1)]
And here, I'm going to do the same for the third column, storing the results in bedrooms_info_2. This is a common operation when you want to filter data based on certain criteria.
# Extract values from the third column
bedrooms_info_2 = bedrooms_info[bedrooms_info[2].isin(specific_values_2)]
With the bedroom information extracted, I will create a new DataFrame. This DataFrame will concatenate the data from the second and third columns.
# Concat bedroom information from the second and third columns
bedrooms_info_complete = pd.concat([bedrooms_info_1[1], bedrooms_info_2[2]],)
Now I will sort the DataFrame index using the .sort_index() method, passing as argument inplace equals True (which means that the sorting is done on the original DataFrame).
# Sort the index
bedrooms_info_complete.sort_index(inplace=True)
By applying the .value_counts() to the bedrooms_info_complete DataFrame, we can count the frequency of each unique value that was extracted.
# Check unique value
bedrooms_info_complete.value_counts()
1 bedroom 4922 Studio 729 2 bedrooms 441 3 bedrooms 113 4 bedrooms 46 6 bedrooms 10 5 bedrooms 9 14 bedrooms 3 7 bedrooms 2 dtype: int64
The values extracted into the 'bedrooms_info_complete' are strings, but the values in the 'bedrooms' column are numeric. Therefore, I'm going to convert the values in 'bedrooms_info_complete' to numeric.
# Convert string values to numeric values
replace_values = {'Studio': 0.0, '1 bedroom': 1.0, '2 bedrooms': 2.0, '3 bedrooms': 3.0, '4 bedrooms': 4.0, '5 bedrooms': 5.0, '6 bedrooms': 6.0, '7 bedrooms': 7.0, '14 bedrooms': 14.0}
bedrooms_info_complete = bedrooms_info_complete.replace(replace_values)
Using the .value_counts() to check the converted values.
# Check unique values
bedrooms_info_complete.value_counts()
1.0 4922 0.0 729 2.0 441 3.0 113 4.0 46 6.0 10 5.0 9 14.0 3 7.0 2 dtype: int64
Before filling in missing values in 'bedrooms' column, I will create a backup of the DataFrame.
# Make another DataFrame backup
data_v3 = data_v2.copy()
After creating a backup, I will use the .fillna() method to fill in the missing values in the ‘bedrooms’ column with the extracted bedroom information.
# Fill the "bedrooms" missing values with bedroom information
data_v3['bedrooms'] = data_v3['bedrooms'].fillna(bedrooms_info_complete)
Upon inspecting the 'bedrooms' column in our DataFrame, we find that there are still 22 missing values. This requires a more detailed analysis.
# Check "bedrooms" missing values
data_v3['bedrooms'].isnull().sum()
22
As we have just 22 missing values, I’m going to inspect each of them to try to find some patterns.
# Check the "bedrooms" 22 missing values
data_v3.loc[data_v3['bedrooms'].isnull()].head(22)
| listing_url | name | host_id | neighbourhood_cleansed | latitude | longitude | room_type | accommodates | bathrooms_text | bedrooms | beds | amenities | price | minimum_nights | number_of_reviews | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1161 | https://www.airbnb.com/rooms/6825713 | Place to stay in Collingwood · ★4.82 · 2 beds ... | 24685614 | Yarra | -37.79515 | 144.985340 | Entire home/apt | 3 | 1 bath | NaN | 2.0 | ["Microwave", "Drying rack for clothing", "Car... | $139.00 | 2 | 298 |
| 2327 | https://www.airbnb.com/rooms/12293289 | Castle in The Patch · ★4.97 · 1 bed · 1 bath | 66261359 | Yarra Ranges | -37.88414 | 145.391480 | Entire home/apt | 2 | 1 bath | NaN | 1.0 | ["Microwave", "Hot water kettle", "BBQ grill",... | $489.00 | 1 | 149 |
| 2628 | https://www.airbnb.com/rooms/13303352 | Farm stay in Chum Creek/Healesville · ★4.95 · ... | 21470508 | Yarra Ranges | -37.61723 | 145.496610 | Entire home/apt | 2 | 1 bath | NaN | 1.0 | ["Microwave", "Induction stove", "Hot water ke... | $141.00 | 1 | 393 |
| 4254 | https://www.airbnb.com/rooms/19989578 | Place to stay in Fitzroy · 2 beds · 1 shared bath | 26178901 | Yarra | -37.79769 | 144.980440 | Private room | 4 | 1 shared bath | NaN | 2.0 | ["Wifi", "Kitchen", "Washer", "Hangers", "Esse... | $300.00 | 1 | 0 |
| 5314 | https://www.airbnb.com/rooms/22488519 | Place to stay in South Melbourne · 1 bed · 1 s... | 526851 | Port Phillip | -37.83458 | 144.968890 | Private room | 1 | 1 shared bath | NaN | 1.0 | ["Smoke alarm", "Wifi", "Kitchen", "Dryer", "I... | $50.00 | 1 | 1 |
| 5503 | https://www.airbnb.com/rooms/22814463 | Place to stay in Badger Creek · ★4.85 · 1 bed ... | 107919473 | Yarra Ranges | -37.68255 | 145.557360 | Entire home/apt | 2 | 1 bath | NaN | 1.0 | ["Kitchen", "Stove", "Patio or balcony", "Dish... | $199.00 | 2 | 185 |
| 5612 | https://www.airbnb.com/rooms/23069274 | Tiny home in Clematis · ★4.90 · 1 bed · 1 bath | 62077033 | Cardinia | -37.92948 | 145.410450 | Entire home/apt | 2 | 1 bath | NaN | 1.0 | ["Microwave", "Hot water kettle", "Drying rack... | $130.00 | 1 | 263 |
| 6114 | https://www.airbnb.com/rooms/24945953 | Castle in The Patch · ★4.92 · 1 bed · 1 bath | 66261359 | Yarra Ranges | -37.88362 | 145.390070 | Entire home/apt | 2 | 1 bath | NaN | 1.0 | ["Microwave", "Induction stove", "Hot water ke... | $494.00 | 1 | 38 |
| 6523 | https://www.airbnb.com/rooms/26888053 | Nature lodge in Mt Dandenong · ★4.78 · 1 bed | 156749664 | Yarra Ranges | -37.82976 | 145.352280 | Hotel room | 2 | 0 baths | NaN | 1.0 | ["Clothing storage: closet", "Microwave", "Hot... | $325.00 | 1 | 41 |
| 8331 | https://www.airbnb.com/rooms/34515725 | Tiny home in Beaumaris · 1 bed · 1 shared bath | 29665549 | Bayside | -37.99194 | 145.041330 | Private room | 2 | 1 shared bath | NaN | 1.0 | ["Smoke alarm", "Kitchen", "Hot tub", "Carbon ... | $150.00 | 1 | 0 |
| 8534 | https://www.airbnb.com/rooms/34977090 | Farm stay in Yarra Junction · ★4.83 · 1 bed · ... | 25918877 | Yarra Ranges | -37.78794 | 145.639140 | Entire home/apt | 2 | 1 bath | NaN | 1.0 | ["Microwave", "Hot water kettle", "BBQ grill",... | $206.00 | 1 | 166 |
| 8712 | https://www.airbnb.com/rooms/36046114 | Tiny home in Sandringham · ★4.93 · 3 beds · 1 ... | 229196502 | Bayside | -37.95524 | 145.021260 | Entire home/apt | 6 | 1 bath | NaN | 3.0 | ["Integrated Fisher and Paykel refrigerator", ... | $220.00 | 3 | 111 |
| 8980 | https://www.airbnb.com/rooms/37500292 | Tiny home in East Warburton · ★4.81 · 3 beds ·... | 262886891 | Yarra Ranges | -37.73093 | 145.752710 | Entire home/apt | 3 | 1 bath | NaN | 3.0 | ["BBQ grill", "Breakfast", "Barbecue utensils"... | $165.00 | 2 | 34 |
| 9104 | https://www.airbnb.com/rooms/38117602 | Farm stay in Healesville · ★4.40 · 1 bed · 1.5... | 100564438 | Yarra Ranges | -37.67870 | 145.515640 | Private room | 1 | 1.5 baths | NaN | 1.0 | ["Smoking allowed", "Smoke alarm", "Wifi", "Pr... | $110.00 | 300 | 6 |
| 9345 | https://www.airbnb.com/rooms/38979142 | Camper/RV in St Kilda · 1 bed · 0 baths | 56324731 | Port Phillip | -37.86176 | 144.972330 | Entire home/apt | 3 | 0 baths | NaN | 1.0 | ["Smoking allowed", "Kitchen", "First aid kit"... | $35.00 | 120 | 0 |
| 9806 | https://www.airbnb.com/rooms/40377799 | Tent in Healesville · ★4.81 · 1 bed · 1 bath | 18768614 | Yarra Ranges | -37.64708 | 145.534910 | Entire home/apt | 2 | 1 bath | NaN | 1.0 | ["Smoke alarm", "Patio or balcony", "Dishes an... | $200.00 | 1 | 194 |
| 9960 | https://www.airbnb.com/rooms/40510306 | Barn in Dixons Creek · ★4.95 · 1 bed · 1 bath | 272422171 | Yarra Ranges | -37.61312 | 145.403040 | Entire home/apt | 2 | 1 bath | NaN | 1.0 | ["Microwave", "Breakfast", "Coffee maker", "Fr... | $150.00 | 2 | 65 |
| 10179 | https://www.airbnb.com/rooms/41382311 | Hut in Millgrove · 3 beds · 1 bath | 15783446 | Yarra Ranges | -37.75091 | 145.648950 | Entire home/apt | 5 | 1 bath | NaN | 3.0 | ["Kitchen", "Essentials", "Free parking on pre... | $80.00 | 50 | 0 |
| 11188 | https://www.airbnb.com/rooms/46096289 | Tent in The Patch · ★4.82 · 1 bed · 1 bath | 31068242 | Yarra Ranges | -37.88634 | 145.399630 | Entire home/apt | 2 | 1 bath | NaN | 1.0 | ["Bluetooth sound system", "Outdoor shower", "... | $185.00 | 1 | 130 |
| 11768 | https://www.airbnb.com/rooms/48741282 | Camper/RV in Lilydale · 1 bed · 1 bath | 7104573 | Yarra Ranges | -37.75890 | 145.391040 | Entire home/apt | 2 | 1 bath | NaN | 1.0 | ["Security cameras on property", "Smoke alarm"... | $170.00 | 2 | 0 |
| 13040 | https://www.airbnb.com/rooms/53895756 | Place to stay in Alphington · ★4.92 · 2 beds ·... | 238794429 | Darebin | -37.77995 | 145.031630 | Entire home/apt | 3 | 1 bath | NaN | 2.0 | ["AC - split type ductless system", "Lockbox",... | $137.00 | 2 | 87 |
| 16833 | https://www.airbnb.com/rooms/767566122456779194 | Tiny home in Healesville · ★4.97 · 1 bed · 1 bath | 50611336 | Yarra Ranges | -37.65515 | 145.513968 | Entire home/apt | 2 | 1 bath | NaN | 1.0 | ["Wifi", "Kitchen", "BBQ grill", "Refrigerator... | $179.00 | 1 | 30 |
After inspecting all 22 missing values in the 'bedrooms' column, it was observed that they all correspond to 'Studio'. Therefore, these mising values will be replaced with 0.0, which represents 'Studio'.
# Fill in the missing values in the 'bedrooms' column with 0.0, which represents 'Studio'
data_v3[data_v3['bedrooms'].isnull()] = data_v3[data_v3['bedrooms'].isnull()].fillna(0.0)
Missing values in the 'bedrooms' column have been successfully handled. The next step is to investigate and address the 183 missing values in the 'beds' column.
# Check for missing values
data_v3.isnull().sum()
listing_url 0 name 0 host_id 0 neighbourhood_cleansed 0 latitude 0 longitude 0 room_type 0 accommodates 0 bathrooms_text 0 bedrooms 0 beds 183 amenities 0 price 0 minimum_nights 0 number_of_reviews 0 dtype: int64
Let's display the first 5 rows in the 'beds' column with NaN values.
# Check "beds" column missing values
data_v3[data_v3['beds'].isnull()].head()
| listing_url | name | host_id | neighbourhood_cleansed | latitude | longitude | room_type | accommodates | bathrooms_text | bedrooms | beds | amenities | price | minimum_nights | number_of_reviews | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 120 | https://www.airbnb.com/rooms/268620 | Home in Caroline Springs · 1 bedroom | 1407948 | Melton | -37.73975 | 144.73742 | Private room | 1 | 0 baths | 1.0 | NaN | [] | $72.00 | 1 | 0 |
| 211 | https://www.airbnb.com/rooms/508590 | Home in Balwyn North · ★4.63 · 1 bedroom | 2255480 | Boroondara | -37.79370 | 145.07935 | Private room | 1 | 0 baths | 1.0 | NaN | [] | $50.00 | 1 | 16 |
| 212 | https://www.airbnb.com/rooms/1742469 | Home in Caulfield South · ★4.92 · 1 bedroom · ... | 9173051 | Glen Eira | -37.89613 | 145.01357 | Private room | 1 | 1 private bath | 1.0 | NaN | ["Microwave", "Fast wifi \u2013 56 Mbps", "Hot... | $83.00 | 2 | 62 |
| 245 | https://www.airbnb.com/rooms/1815508 | Rental unit in Richmond · ★4.83 · 1 bedroom · ... | 9509273 | Yarra | -37.82210 | 144.99922 | Entire home/apt | 2 | 1 bath | 1.0 | NaN | ["Microwave", "Free street parking", "Lockbox"... | $141.00 | 22 | 413 |
| 642 | https://www.airbnb.com/rooms/4372398 | Rental unit in St Kilda · ★4.66 · 1 bedroom · ... | 22700287 | Port Phillip | -37.86735 | 144.97775 | Entire home/apt | 2 | 1 bath | 1.0 | NaN | ["Microwave", "Hot water kettle", "Books and r... | $125.00 | 3 | 74 |
Upon analyzing the data, a common pattern emerged: listings with either 1 bedroom or a studio, accommodating no more than 2 guests, typically have 1 bed. As such, the missing 'beds' values in these listings will be filled with the mode, which is 1.
# Impute missing 'beds' values for listings accommodating up to 2 guests in 1 bedroom or less with the mode of 'beds', which is 1
data_v3.loc[(data_v3['beds'].isnull()) & (data_v3['accommodates'] <= 2) & (data_v3['bedrooms'] <= 1), 'beds'] = st.mode(data_v3['beds'])
We still have 31 missings values in the 'beds' column.
# Check for missing values
data_v3.isnull().sum()
listing_url 0 name 0 host_id 0 neighbourhood_cleansed 0 latitude 0 longitude 0 room_type 0 accommodates 0 bathrooms_text 0 bedrooms 0 beds 31 amenities 0 price 0 minimum_nights 0 number_of_reviews 0 dtype: int64
After dealing with missing values based on a specific condition, the 'listing_url' column will be used to tackle the remaining missing data.
# Analyze the missing 'beds' values and checking the 'listing_url' column as a reference to handle the missing values
data_v3[data_v3['beds'].isnull()].head(10)
| listing_url | name | host_id | neighbourhood_cleansed | latitude | longitude | room_type | accommodates | bathrooms_text | bedrooms | beds | amenities | price | minimum_nights | number_of_reviews | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 783 | https://www.airbnb.com/rooms/5573248 | Home in Richmond · ★4.89 · 2 bedrooms · 1.5 baths | 6518033 | Yarra | -37.81748 | 144.99520 | Entire home/apt | 4 | 1.5 baths | 2.0 | NaN | ["Stainless steel gas stove", "Microwave", "St... | $206.00 | 1 | 264 |
| 845 | https://www.airbnb.com/rooms/5096866 | Home in Coburg North · ★4.71 · 1 bedroom · 1 s... | 26336840 | Moreland | -37.72044 | 144.96969 | Private room | 3 | 1 shared bath | 1.0 | NaN | ["Smoking allowed", "Smoke alarm", "Wifi", "Ki... | $42.00 | 5 | 43 |
| 2430 | https://www.airbnb.com/rooms/12633686 | Townhouse in Mulgrave · 4 bedrooms · 2.5 baths | 40094786 | Monash | -37.92501 | 145.18905 | Entire home/apt | 6 | 2.5 baths | 4.0 | NaN | ["Microwave", "Hot water kettle", "BBQ grill",... | $380.00 | 15 | 0 |
| 2780 | https://www.airbnb.com/rooms/13592574 | Rental unit in Prahran · ★4.94 · 1 bedroom · 1... | 78547489 | Stonnington | -37.85131 | 144.98881 | Entire home/apt | 4 | 1 bath | 1.0 | NaN | ["Stainless steel gas stove", "Microwave", "Ho... | $160.00 | 4 | 35 |
| 2895 | https://www.airbnb.com/rooms/14094124 | Townhouse in Brighton · ★5.0 · 2 bedrooms · 2 ... | 84806429 | Bayside | -37.88982 | 145.00013 | Entire home/apt | 4 | 2 baths | 2.0 | NaN | ["Wifi", "Kitchen", "Dryer", "Iron", "First ai... | $175.00 | 3 | 3 |
| 3419 | https://www.airbnb.com/rooms/16440438 | Cottage in Hoddles Creek · ★5.0 · 2 bedrooms ·... | 88051903 | Yarra Ranges | -37.81570 | 145.56603 | Entire home/apt | 4 | 1 bath | 2.0 | NaN | ["Microwave", "Hot water kettle", "AC - split ... | $330.00 | 2 | 53 |
| 3478 | https://www.airbnb.com/rooms/16233947 | Home in Clifton Hill · ★5.0 · 3 bedrooms · 2 b... | 33071051 | Yarra | -37.78820 | 145.00014 | Entire home/apt | 4 | 2 baths | 3.0 | NaN | ["Pool", "Smoke alarm", "Wifi", "Kitchen", "Dr... | $350.00 | 7 | 12 |
| 5786 | https://www.airbnb.com/rooms/23931198 | Villa in Carrum · ★4.83 · 3 bedrooms · 1 share... | 131401047 | Kingston | -38.08143 | 145.14034 | Private room | 1 | 1 shared bath | 3.0 | NaN | ["Smoke alarm", "Wifi", "Kitchen", "Bed linens... | $38.00 | 3 | 6 |
| 8229 | https://www.airbnb.com/rooms/33601421 | Home in Kew · ★5.0 · 2 bedrooms · 1.5 shared b... | 50264189 | Boroondara | -37.79356 | 145.03792 | Private room | 2 | 1.5 shared baths | 2.0 | NaN | ["Microwave", "BBQ grill", "Private living roo... | $69.00 | 1 | 3 |
| 9302 | https://www.airbnb.com/rooms/38472816 | Condo in Southbank · ★4.53 · 1 bedroom · 1 bath | 155404054 | Melbourne | -37.82225 | 144.96119 | Entire home/apt | 4 | 1 bath | 1.0 | NaN | ["Microwave", "Bluetooth sound system", "Harbo... | $122.00 | 1 | 74 |
Out of 31 missing values, each was checked against a specifc 'listing_url'. I identified 26 corresponding values that will be used to fill the NaNs.
# Fill the 'beds' missing values
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/5573248', 'beds'] = 4.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/5096866', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/13592574', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/14094124', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/16440438', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/16233947', 'beds'] = 3.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/23931198', 'beds'] = 3.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/33601421', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/38472816', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/39319156', 'beds'] = 9.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/39454193', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/40165294', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/40603921', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/42317481', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/42318082', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/43293837', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/43117636', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/46879468', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/49820299', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/582835905208247854', 'beds'] = 3.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/720080948267459603', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/799591397619258644', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/798136718669520279', 'beds'] = 2.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/840129849005783804', 'beds'] = 3.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/939039430629377525', 'beds'] = 1.0
data_v3.loc[data_v3['listing_url'] == 'https://www.airbnb.com/rooms/962738485338707503', 'beds'] = 2.0
For the remaining 5 values, sufficient information was not found, so I'm going to drop them and reset the index using .reset_index() method.
# Drop 5 rows from the 'beds' column
data_v3 = data_v3.dropna(subset=['beds']).reset_index(drop=True)
Let's check the missing values.
# Check for missing values
round(data_v3.isnull().sum())
listing_url 0 name 0 host_id 0 neighbourhood_cleansed 0 latitude 0 longitude 0 room_type 0 accommodates 0 bathrooms_text 0 bedrooms 0 beds 0 amenities 0 price 0 minimum_nights 0 number_of_reviews 0 dtype: int64
All missing values have now been handled.
# Check the percentage of missing values
round(data_v3.isnull().sum() / len(data_v3) * 100,2)
listing_url 0.0 name 0.0 host_id 0.0 neighbourhood_cleansed 0.0 latitude 0.0 longitude 0.0 room_type 0.0 accommodates 0.0 bathrooms_text 0.0 bedrooms 0.0 beds 0.0 amenities 0.0 price 0.0 minimum_nights 0.0 number_of_reviews 0.0 dtype: float64
Ensuring that your data is in the correct format before beginning the analysis is crucial as it helps prevent errors and enhances the accuracy and efficiency of your analysis. Therefore, I will use the .dtypes attribute to verify the data types of the columns. Upon inspection, we observe that the ‘bedrooms’, ‘beds’, and ‘price’ columns need formatting.
# Check data type
data_v3.dtypes
listing_url object name object host_id int64 neighbourhood_cleansed object latitude float64 longitude float64 room_type object accommodates int64 bathrooms_text object bedrooms float64 beds float64 amenities object price object minimum_nights int64 number_of_reviews int64 dtype: object
The ‘price’ column is currently an object data type, but it needs to be numeric. To achieve this, I will use a regular expression to replace any occurrence of a dollar sign ($) or comma (,) in the ‘price’ column with an empty string (''), effectively removing these characters. The .str attribute allows string methods to be applied to the Series object. Finally, I will use pd.to_numeric() to convert the column into a numeric data type.
# Convert the column 'price' to numeric
data_v3['price'] = pd.to_numeric(data_v3['price'].str.replace(r'[\$,]', '', regex=True))
Checking the 'price' column data type.
# Check 'price' data type
data_v3['price'].dtypes
dtype('float64')
The ‘bedrooms’ and ‘beds’ columns in the DataFrame are currently of float data type, which I plan to convert to integers. To do this, I will use the .astype() method with ‘int64’ as the argument, indicating that the values should be converted to 64-bit integers.
# Convert the columns 'bedrooms' and 'beds' to integers
data_v3[['bedrooms', 'beds']] = data_v3[['bedrooms', 'beds']].astype('int64')
Checking the 'bedrooms' and 'beds' column data type.
# Check 'bedrooms' and 'beds' data type
data_v3[['bedrooms', 'beds']].dtypes
bedrooms int64 beds int64 dtype: object
In this step, we aim to identify and correct any out-of-place or incorrect data in the DataFrame. I used the .value_counts() method to examine the values in the ‘neighbourhood_cleansed’ column. I found an outdated local government area name, ‘Moreland’, which has been renamed to ‘Merri-bek’.
# Check unique values in 'neighbourhood_cleansed' column
data_v3['neighbourhood_cleansed'].value_counts()
Melbourne 7279 Port Phillip 2742 Yarra 1776 Stonnington 1514 Yarra Ranges 1112 Moreland 942 Monash 590 Boroondara 589 Whitehorse 583 Wyndham 576 Glen Eira 558 Darebin 551 Maribyrnong 515 Bayside 401 Kingston 363 Moonee Valley 353 Hobsons Bay 267 Frankston 263 Brimbank 258 Casey 234 Manningham 227 Hume 207 Banyule 201 Cardinia 186 Knox 172 Greater Dandenong 172 Whittlesea 172 Melton 157 Maroondah 117 Nillumbik 103 Name: neighbourhood_cleansed, dtype: int64
To handle that, I'm going to filter the data where 'neighbourhood_cleansed' values are equal to 'Moreland', and finally replace them with 'Merri-bek'.
# Replace 'Moreland' values in the 'neighbourhood_cleansed' column with 'Merri-bek'
data_v3.loc[data_v3['neighbourhood_cleansed'] == 'Moreland', 'neighbourhood_cleansed'] = 'Merri-bek'
Looking for duplicates, I used the .duplicated().sum() method, but none were found.
# Check duplicate data
data_v3.duplicated().sum()
0
Outliers are data points in a DataFrame that significantly differ from others. By using the .describe() method to inspect quantitative data, I noticed some extreme values in the ‘beds’, ‘price’, and ‘minimum_nights’ columns. Therefore, I plan to examine these columns using a boxplot.
# Check quantitative data
data_v3.describe()
| host_id | latitude | longitude | accommodates | bedrooms | beds | price | minimum_nights | number_of_reviews | |
|---|---|---|---|---|---|---|---|---|---|
| count | 2.318000e+04 | 23180.000000 | 23180.000000 | 23180.000000 | 23180.000000 | 23180.000000 | 23180.000000 | 23180.000000 | 23180.000000 |
| mean | 1.607798e+08 | -37.828200 | 145.013867 | 3.659232 | 1.717990 | 2.120449 | 230.782442 | 5.305824 | 30.810224 |
| std | 1.585869e+08 | 0.074548 | 0.150693 | 2.357892 | 1.073017 | 1.755538 | 1333.728563 | 29.119214 | 60.811455 |
| min | 7.834000e+03 | -38.225690 | 144.518047 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
| 25% | 3.038696e+07 | -37.854085 | 144.956621 | 2.000000 | 1.000000 | 1.000000 | 95.000000 | 1.000000 | 1.000000 |
| 50% | 9.884258e+07 | -37.819435 | 144.977230 | 3.000000 | 1.000000 | 2.000000 | 149.000000 | 2.000000 | 7.000000 |
| 75% | 2.628817e+08 | -37.802090 | 145.022482 | 4.000000 | 2.000000 | 3.000000 | 226.000000 | 3.000000 | 32.000000 |
| max | 5.351276e+08 | -37.484800 | 145.844997 | 16.000000 | 14.000000 | 117.000000 | 104983.000000 | 1125.000000 | 987.000000 |
To create a boxplot, I used the box function from the plotly.express module (which is often imported as px), and stored that in the variable fig. Then, I used the .update_layout() function to customize the layout of the plot. Displaying the plot through .show() function, we can observe the presence of extreme values in the 'beds' column.
# Create the graph
fig = px.box(data_v3, x='beds')
fig.update_layout(
title='Boxplot analysis variable - beds',
title_x=0.5,
autosize=False,
width=800,
height=500)
# Show the boxplot
fig.show()
Observing the presence of extreme values in the boxplot above, caution is necessary as these extreme values may not necessarily be outliers. For further investigation, I will use the 'listing_url' column as a reference.
# Inspect the 'beds' column where values are higher or equal to 22 using the 'listing_url' to check the 'beds' values
data_v3[data_v3['beds'] >= 22].head(3)
| listing_url | name | host_id | neighbourhood_cleansed | latitude | longitude | room_type | accommodates | bathrooms_text | bedrooms | beds | amenities | price | minimum_nights | number_of_reviews | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11279 | https://www.airbnb.com/rooms/46266224 | Vacation home in Yarra Glen · 10 bedrooms · 22... | 107232439 | Yarra Ranges | -37.622600 | 145.372260 | Entire home/apt | 16 | 9 baths | 10 | 22 | ["Microwave", "BBQ grill", "Barbecue utensils"... | 2400.0 | 2 | 2 |
| 15169 | https://www.airbnb.com/rooms/690072715126723461 | casa particular in Melbourne · 3 bedrooms · 11... | 457710697 | Melbourne | -37.815070 | 144.962610 | Entire home/apt | 2 | 2.5 baths | 3 | 117 | ["Smoking allowed", "Window AC unit", "Pets al... | 19045.0 | 1 | 0 |
| 19381 | https://www.airbnb.com/rooms/857121112148324865 | Home in Deer Park · 3 bedrooms · 31 beds · 2 b... | 485621814 | Brimbank | -37.761819 | 144.772296 | Entire home/apt | 16 | 2 baths | 3 | 31 | ["Microwave", "Hot water kettle", "BBQ grill",... | 169.0 | 1 | 1 |
Using the .shape attribute revealed 463 extreme values. Hovewer, a detailed examination identified only one true outlier. Without this through investigation, we could have erroneously removed 462 valid rows.
# Filter 'beds' column with values above or equal to 7 and counting them
data_v3[data_v3['beds'] >= 7].shape
(463, 15)
Now, I will remove the data point in the 'beds' column that indicates 117 beds, as it's an outlier.
# Remove the outlier from the 'beds' column and reseting the DataFrame index
data_v3 = data_v3[data_v3['beds'] != 117].reset_index(drop=True)
As we can see from our boxplot, there are still extreme values. However, upon closer examination, these extreme values do not qualify as outliers, as I previously mentioned.
# Create the graph
fig = px.box(data_v3, x='beds')
fig.update_layout(
title='Boxplot analysis variable - beds',
title_x=0.5,
autosize=False,
width=800,
height=500)
# Show the boxplot
fig.show()
Checking quantitative data, we found values equal to 0 in 'price' column and that does not make sense. Therefore, I will check that.
# Check quantitative data
data_v3.describe()
| host_id | latitude | longitude | accommodates | bedrooms | beds | price | minimum_nights | number_of_reviews | |
|---|---|---|---|---|---|---|---|---|---|
| count | 2.317900e+04 | 23179.000000 | 23179.000000 | 23179.000000 | 23179.000000 | 23179.000000 | 23179.000000 | 23179.000000 | 23179.000000 |
| mean | 1.607670e+08 | -37.828201 | 145.013869 | 3.659304 | 1.717934 | 2.115492 | 229.970749 | 5.306010 | 30.811554 |
| std | 1.585783e+08 | 0.074549 | 0.150696 | 2.357918 | 1.073007 | 1.585129 | 1328.019560 | 29.119828 | 60.812430 |
| min | 7.834000e+03 | -38.225690 | 144.518047 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
| 25% | 3.038696e+07 | -37.854090 | 144.956621 | 2.000000 | 1.000000 | 1.000000 | 95.000000 | 1.000000 | 1.000000 |
| 50% | 9.883592e+07 | -37.819440 | 144.977230 | 3.000000 | 1.000000 | 2.000000 | 149.000000 | 2.000000 | 7.000000 |
| 75% | 2.628817e+08 | -37.802090 | 145.022485 | 4.000000 | 2.000000 | 3.000000 | 226.000000 | 3.000000 | 32.000000 |
| max | 5.351276e+08 | -37.484800 | 145.844997 | 16.000000 | 14.000000 | 31.000000 | 104983.000000 | 1125.000000 | 987.000000 |
In this column, I will filter the data to investigate the rows trying to uncover some relevant information.
# Filter 'price' column with values below $30
data_v3[(data_v3['price'] < 30)].head(15)
| listing_url | name | host_id | neighbourhood_cleansed | latitude | longitude | room_type | accommodates | bathrooms_text | bedrooms | beds | amenities | price | minimum_nights | number_of_reviews | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 367 | https://www.airbnb.com/rooms/3094430 | Home in Ormond · 1 bedroom · 1 bed · 1 bath | 15736128 | Glen Eira | -37.90021 | 145.03742 | Private room | 1 | 1 bath | 1 | 1 | ["Wifi", "Kitchen", "Washer", "First aid kit",... | 26.0 | 1 | 0 |
| 413 | https://www.airbnb.com/rooms/3194833 | Rental unit in Southbank · 1 bedroom · 2 beds ... | 1424387 | Port Phillip | -37.82871 | 144.95737 | Shared room | 2 | 1 shared bath | 1 | 2 | ["Pool", "Smoke alarm", "Wifi", "Kitchen", "Ho... | 24.0 | 1 | 1 |
| 727 | https://www.airbnb.com/rooms/4786567 | Rental unit in South Yarra · ★4.67 · 1 bedroom... | 24674881 | Melbourne | -37.83538 | 144.98543 | Private room | 1 | 1 bath | 1 | 1 | ["Smoke alarm", "Wifi", "Kitchen", "Dryer", "E... | 25.0 | 7 | 3 |
| 761 | https://www.airbnb.com/rooms/5196283 | Rental unit in Flemington · ★4.60 · 1 bedroom ... | 19752824 | Melbourne | -37.78868 | 144.92314 | Private room | 3 | 1 bath | 1 | 2 | ["Smoke alarm", "Wifi", "Dryer", "Pets allowed... | 20.0 | 3 | 5 |
| 1091 | https://www.airbnb.com/rooms/7207612 | Home in Northcote · 1 bedroom · 1 bed · 1 shar... | 37741609 | Darebin | -37.76533 | 145.00940 | Private room | 2 | 1 shared bath | 1 | 1 | ["Wifi", "Smoke alarm", "Kitchen", "Essentials... | 25.0 | 12 | 0 |
| 1118 | https://www.airbnb.com/rooms/7253120 | Home in Richmond · 1 bedroom · 1 bed · 1 bath | 27460296 | Yarra | -37.81993 | 144.99620 | Private room | 2 | 1 bath | 1 | 1 | ["Smoke alarm", "Wifi", "Kitchen", "Washer", "... | 28.0 | 3 | 2 |
| 1182 | https://www.airbnb.com/rooms/6855196 | Home in Camberwell · 1 bedroom · 1 bed · 1 bath | 35925193 | Boroondara | -37.84727 | 145.07037 | Private room | 2 | 1 bath | 1 | 1 | ["Smoking allowed", "Wifi", "Kitchen", "Washer... | 13.0 | 1 | 0 |
| 1228 | https://www.airbnb.com/rooms/7440879 | Home in Coburg · ★4.88 · 1 bedroom · 1 bed · 1... | 33559438 | Merri-bek | -37.73810 | 144.97693 | Shared room | 1 | 1.5 shared baths | 1 | 1 | ["Pool", "Smoke alarm", "Wifi", "Kitchen", "Wa... | 25.0 | 1 | 8 |
| 1300 | https://www.airbnb.com/rooms/7060419 | Home in Camberwell · 1 bedroom · 1 bed · 1 bath | 37024203 | Boroondara | -37.84818 | 145.09497 | Private room | 2 | 1 bath | 1 | 1 | ["Wifi", "Smoke alarm", "Kitchen", "Essentials... | 20.0 | 1 | 0 |
| 1434 | https://www.airbnb.com/rooms/8905856 | Home in Clifton Hill · 1 bedroom · 1 bed · 1 bath | 46582678 | Yarra | -37.78931 | 144.99854 | Private room | 2 | 1 bath | 1 | 1 | ["Wifi", "Kitchen", "Smoke alarm", "TV", "Free... | 26.0 | 1 | 0 |
| 1468 | https://www.airbnb.com/rooms/8375432 | Home in Carlton North · 1 bedroom · 1 bed · 1 ... | 44142350 | Yarra | -37.78067 | 144.97007 | Private room | 1 | 1 bath | 1 | 1 | ["Wifi", "Smoke alarm", "Kitchen", "Breakfast"... | 22.0 | 12 | 0 |
| 1476 | https://www.airbnb.com/rooms/8394057 | Home in Highett · 1 bedroom · 1 bed · 1.5 baths | 44229140 | Bayside | -37.94779 | 145.03006 | Private room | 1 | 1.5 baths | 1 | 1 | ["Wifi", "Smoke alarm", "Kitchen", "Washer", "... | 25.0 | 1 | 1 |
| 1539 | https://www.airbnb.com/rooms/8656155 | Rental unit in Thornbury · 1 bedroom · 1 bed ·... | 45489433 | Darebin | -37.75685 | 145.01064 | Private room | 1 | 1 bath | 1 | 1 | ["Wifi", "Kitchen", "Washer", "Essentials", "F... | 27.0 | 7 | 0 |
| 1822 | https://www.airbnb.com/rooms/10000335 | Place to stay in Clayton · 2 bedrooms · 1 bed ... | 51373610 | Monash | -37.91737 | 145.13953 | Private room | 1 | 3 shared baths | 2 | 1 | ["Smoke alarm", "Wifi", "Kitchen", "Gym", "Iro... | 26.0 | 3 | 0 |
| 1952 | https://www.airbnb.com/rooms/10572460 | Home in Altona Meadows · 1 bedroom · 1 bed · 1... | 51890204 | Hobsons Bay | -37.87756 | 144.76848 | Private room | 2 | 1 bath | 1 | 1 | ["Smoke alarm", "Wifi", "Kitchen", "Dryer", "I... | 26.0 | 3 | 2 |
Upon examining the 'price' column, we noticed that listings priced below $30 showed an update of price information. As we have lots of rows, I decided to implement the strategy to replace these values by the median using the .median() method.
# Replace the values below 30 in the 'price' column with the median of the 'price' column
data_v3.loc[data_v3['price'] < 30, 'price'] = data_v3['price'].median()
Now I'm going to look at the higher prices. To handle with that I will use the boxplot.
# Create the graph
fig = px.box(data_v3, x='price')
fig.update_layout(
title='Boxplot analysis variable - price',
title_x=0.5,
autosize=False,
width=800,
height=500)
# Show the boxplot
fig.show()
Through the boxplot we can see some higher values. So, I will analyze prices higher than $2000 to check if I'm going to consider them, otherwise, I'm going to remove them.
# Filter prices higher than $2000
data_v3[(data_v3['price'] > 2000)].head(10)
| listing_url | name | host_id | neighbourhood_cleansed | latitude | longitude | room_type | accommodates | bathrooms_text | bedrooms | beds | amenities | price | minimum_nights | number_of_reviews | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 78 | https://www.airbnb.com/rooms/1353588 | Townhouse in Richmond · ★4.79 · Studio · 2 bed... | 6362922 | Yarra | -37.82157 | 144.99694 | Private room | 1 | 0 shared baths | 0 | 2 | ["Wifi", "Kitchen", "Iron", "Hangers", "Breakf... | 20000.0 | 2 | 95 |
| 705 | https://www.airbnb.com/rooms/4767182 | Home in Malvern · 5 bedrooms · 5 beds · 4 baths | 11914644 | Stonnington | -37.84542 | 145.03280 | Entire home/apt | 10 | 4 baths | 5 | 5 | ["Microwave", "Coffee maker", "Host greets you... | 3304.0 | 1 | 0 |
| 1246 | https://www.airbnb.com/rooms/7490588 | Rental unit in Melbourne · ★4.49 · 2 bedrooms ... | 27519695 | Melbourne | -37.81474 | 144.95128 | Entire home/apt | 8 | 1 bath | 2 | 2 | ["Microwave", "Carbon monoxide alarm", "Lockbo... | 6825.0 | 400 | 400 |
| 1885 | https://www.airbnb.com/rooms/10476673 | Home in Heatherton · 3 bedrooms · 3 beds · 2.5... | 31713704 | Kingston | -37.95436 | 145.08372 | Entire home/apt | 6 | 2.5 baths | 3 | 3 | ["Pool", "Smoke alarm", "Wifi", "Kitchen", "Dr... | 2243.0 | 7 | 0 |
| 2347 | https://www.airbnb.com/rooms/12100235 | Home in Healesville · ★4.90 · 10 bedrooms · 13... | 64874108 | Yarra Ranges | -37.65557 | 145.51727 | Entire home/apt | 16 | 9.5 baths | 10 | 13 | ["Microwave", "Hot water kettle", "BBQ grill",... | 2127.0 | 2 | 49 |
| 2461 | https://www.airbnb.com/rooms/12719024 | Home in Reservoir · 2 bedrooms · 2 beds · 1 bath | 69151339 | Darebin | -37.72707 | 144.99738 | Entire home/apt | 4 | 1 bath | 2 | 2 | ["Smoking allowed", "Smoke alarm", "Wifi", "Ki... | 12624.0 | 10 | 1 |
| 2465 | https://www.airbnb.com/rooms/12736588 | Home in Thornbury · 1 bedroom · 1 bed · 1.5 baths | 47493458 | Darebin | -37.75935 | 145.00777 | Private room | 2 | 1.5 baths | 1 | 1 | ["Smoking allowed", "Smoke alarm", "Wifi", "Ki... | 2500.0 | 3 | 0 |
| 2999 | https://www.airbnb.com/rooms/14919472 | Rental unit in Southbank · ★5.0 · 4 bedrooms ·... | 29183827 | Melbourne | -37.82225 | 144.96365 | Entire home/apt | 6 | 3.5 baths | 4 | 5 | ["Smoke alarm", "Wifi", "Kitchen", "Iron", "Ca... | 2057.0 | 7 | 4 |
| 3491 | https://www.airbnb.com/rooms/16260007 | Home in Reservoir · ★4.0 · 1 bedroom · 1 bed ·... | 96127123 | Darebin | -37.69922 | 145.01132 | Private room | 2 | 1 bath | 1 | 1 | ["Smoking allowed", "Smoke alarm", "Wifi", "Ki... | 13379.0 | 1 | 3 |
| 3777 | https://www.airbnb.com/rooms/18043096 | Rental unit in Saint Kilda · ★4.58 · 2 bedroom... | 105279173 | Port Phillip | -37.86593 | 144.98351 | Private room | 2 | 1 shared bath | 2 | 1 | ["Smoke alarm", "Wifi", "Kitchen", "Bed linens... | 5500.0 | 14 | 12 |
After analyzing the 'price' column, I found that the marjority of listings priced above $2000 lack information, appear to be duplicates, have no reviews, are outdated, and are not availaible for check in. Therefore, I'm considering removing all listings priced above $2000.
# Remove values above to 2000
data_v3 = data_v3[data_v3['price'] <= 2000].reset_index(drop=True)
The boxplot shows many extreme values. However, these were not considered outliers because they represent valid data points.
# Create the graph
fig = px.box(data_v3, x='price')
fig.update_layout(
title='Boxplot analysis variable - price',
title_x=0.5,
autosize=False,
width=800,
height=500
)
# Show the boxplot
fig.show()
Using the .describe(), I have observed an unusually high value in the 'minimum_nights' column. Let's investigate that.
# Check quantitative data
data_v3.describe()
| host_id | latitude | longitude | accommodates | bedrooms | beds | price | minimum_nights | number_of_reviews | |
|---|---|---|---|---|---|---|---|---|---|
| count | 2.309400e+04 | 23094.000000 | 23094.000000 | 23094.000000 | 23094.000000 | 23094.000000 | 23094.000000 | 23094.000000 | 23094.000000 |
| mean | 1.608007e+08 | -37.828205 | 145.013783 | 3.649952 | 1.712999 | 2.109725 | 197.130423 | 5.235472 | 30.866372 |
| std | 1.586302e+08 | 0.074545 | 0.150722 | 2.346839 | 1.064941 | 1.572811 | 187.706796 | 28.747963 | 60.825704 |
| min | 7.834000e+03 | -38.225690 | 144.518047 | 1.000000 | 0.000000 | 1.000000 | 30.000000 | 1.000000 | 0.000000 |
| 25% | 3.039534e+07 | -37.854127 | 144.956592 | 2.000000 | 1.000000 | 1.000000 | 99.000000 | 1.000000 | 1.000000 |
| 50% | 9.881559e+07 | -37.819410 | 144.977188 | 3.000000 | 1.000000 | 2.000000 | 149.000000 | 2.000000 | 7.000000 |
| 75% | 2.628817e+08 | -37.802066 | 145.022396 | 4.000000 | 2.000000 | 3.000000 | 225.000000 | 3.000000 | 32.000000 |
| max | 5.351276e+08 | -37.484800 | 145.844997 | 16.000000 | 14.000000 | 31.000000 | 2000.000000 | 1125.000000 | 987.000000 |
Investigating through the boxplot we find some extreme values.
# Create the graph
fig = px.box(data_v3, x='minimum_nights')
fig.update_layout(
title='Boxplot analysis variable - minimum_nights',
title_x=0.5,
autosize=False,
width=800,
height=500
)
# Show the boxplot
fig.show()
Here, I'm going to filter the data where minimum_nights are more than 365.
# Filter 'minimum_nights' column with values above 365
data_v3[data_v3['minimum_nights'] > 365].head(10)
| listing_url | name | host_id | neighbourhood_cleansed | latitude | longitude | room_type | accommodates | bathrooms_text | bedrooms | beds | amenities | price | minimum_nights | number_of_reviews | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 110 | https://www.airbnb.com/rooms/900838 | Loft in Richmond · ★4.88 · 1 bedroom · 1 bed ·... | 4812182 | Yarra | -37.82286 | 144.99065 | Entire home/apt | 2 | 1 bath | 1 | 1 | ["Smoke alarm", "Wifi", "Kitchen", "Dryer", "I... | 232.0 | 803 | 141 |
| 590 | https://www.airbnb.com/rooms/4632666 | Rental unit in Melbourne · ★4.62 · 2 bedrooms ... | 4924814 | Melbourne | -37.81572 | 144.95447 | Entire home/apt | 5 | 1 bath | 2 | 3 | ["Microwave", "BBQ grill", "Free street parkin... | 149.0 | 700 | 165 |
| 650 | https://www.airbnb.com/rooms/4844270 | Rental unit in South Yarra · 1 bedroom · 1 bed... | 21300152 | Melbourne | -37.82967 | 144.98577 | Entire home/apt | 2 | 1 bath | 1 | 1 | ["Wifi", "Kitchen", "Iron", "Hangers", "Essent... | 125.0 | 500 | 1 |
| 1974 | https://www.airbnb.com/rooms/10770086 | Rental unit in Travancore · 1 bedroom · 1 bed ... | 54793045 | Melbourne | -37.78452 | 144.93970 | Private room | 2 | 1 bath | 1 | 1 | ["Smoke alarm", "Wifi", "Kitchen", "Gym", "Iro... | 34.0 | 900 | 1 |
| 2922 | https://www.airbnb.com/rooms/14833595 | Rental unit in Melbourne · ★4.69 · 3 bedrooms ... | 4924814 | Melbourne | -37.81624 | 144.95495 | Entire home/apt | 9 | 2 baths | 3 | 5 | ["Microwave", "Free street parking", "High cha... | 269.0 | 600 | 125 |
| 2950 | https://www.airbnb.com/rooms/14581093 | Rental unit in Southbank · ★4.78 · 2 bedrooms ... | 4924814 | Melbourne | -37.82821 | 144.95896 | Entire home/apt | 6 | 2 baths | 2 | 3 | ["Free street parking", "High chair", "Hot wat... | 269.0 | 1124 | 75 |
| 2971 | https://www.airbnb.com/rooms/14893028 | Rental unit in Melbourne · ★4.76 · 3 bedrooms ... | 4924814 | Melbourne | -37.81532 | 144.95331 | Entire home/apt | 8 | 2 baths | 3 | 5 | ["Microwave", "Free street parking", "Lockbox"... | 299.0 | 600 | 137 |
| 4570 | https://www.airbnb.com/rooms/21035543 | Home in Wantirna · 1 bedroom · 1 bed · 1 priva... | 42784943 | Knox | -37.84083 | 145.22844 | Private room | 2 | 1 private bath | 1 | 1 | ["Smoking allowed", "Smoke alarm", "Wifi", "Ki... | 40.0 | 1000 | 1 |
| 4669 | https://www.airbnb.com/rooms/21194478 | Rental unit in Southbank · ★4.70 · 3 bedrooms ... | 4924814 | Melbourne | -37.82727 | 144.96079 | Entire home/apt | 9 | 2 baths | 3 | 5 | ["Microwave", "Free street parking", "High cha... | 209.0 | 700 | 55 |
| 4968 | https://www.airbnb.com/rooms/21792086 | Rental unit in Melbourne · ★4.30 · 1 bedroom ·... | 156159929 | Melbourne | -37.80936 | 144.96001 | Private room | 2 | 1 shared bath | 1 | 1 | ["Pool", "Wifi", "Kitchen", "Hangers", "Essent... | 55.0 | 1000 | 52 |
After inspecting 'minimum_nights' values above 365, I discovered that most are outdated, some appears to be duplicates, and others are not available for check in. There's only one value that appears valid but with outdated information. I will update this value and remove the rest.
# Update the 'minimum_nights' value equal to 999 with 2, removing the values above 365 and reseting the DataFrame index
data_v3.loc[(data_v3['minimum_nights'] == 999), 'minimum_nights'] = 2
data_v3 = data_v3.loc[(data_v3['minimum_nights'] <= 365)].reset_index(drop=True)
Once again, the boxplot reveals a significant number of extreme values. Despite their extremity, these were not categorized as outliers.
# Create the graph
fig = px.box(data_v3, x='minimum_nights')
fig.update_layout(
title='Boxplot analysis variable - minimum_nights',
title_x=0.5,
autosize=False,
width=800,
height=500
)
# Show the boxplot
fig.show()
As the 'name' and 'listing_url' columns don't provide more significant information, we'll exclude them using the .drop() method with the 'columns' argument, specifying the columns to be removed.
# Drop the 'name' and 'listing_url' columns
data_v3 = data_v3.drop(columns=['name', 'listing_url'])
Now I'm going to make another DataFrame backup.
# Make DataFrame backup
data_v4 = data_v3.copy()
I will export the DataFrame to a CSV file using the .to_csv() method. It takes two arguments: the path where the file will be saved and a boolean value to avoid writing row indices. I will use this CSV file to create a Tableau dashboard.
# Export DataFrame
data_v4.to_csv('listings_cleaned.csv', index=False)
In EDA (Exploratory Data Analysis), I'm going to summarize, visualize, and interpret data, which can help in identifying patterns and trends, so that we can find insights that answer our questions. In this stage, I'm going to perform Univariate, Bivariate and Correlation Analysis.
Univariate Analysis involves studying a single data variable. This analysis can help us understand the data's characteristics, such as the distribution, central tendency, and dispersion. It's often the first step in data analysis, providing valuable insights before moving on to bivariate analysis. So, let's answer our questions!
# Create a frequency table for the top 10 hosts
top_10_hosts = pd.DataFrame({
'Host ID': data_v4['host_id'].value_counts().head(10).index,
'Number of Listings': data_v4['host_id'].value_counts().head(10).values,
'Percentage of Listings': np.round((data_v4['host_id'].value_counts().head(10) / data_v4.shape[0]) * 100, 2)
})
# Reset the index to get a standard numeric index
top_10_hosts.reset_index(drop=True, inplace=True)
# Adjust the index to start at 1 instead of 0
top_10_hosts.index = top_10_hosts.index + 1
# Display the frequency table
top_10_hosts
| Host ID | Number of Listings | Percentage of Listings | |
|---|---|---|---|
| 1 | 90729398 | 327 | 1.42 |
| 2 | 279001183 | 250 | 1.08 |
| 3 | 343442154 | 154 | 0.67 |
| 4 | 1739996 | 98 | 0.42 |
| 5 | 412803321 | 80 | 0.35 |
| 6 | 22860147 | 80 | 0.35 |
| 7 | 446080599 | 77 | 0.33 |
| 8 | 760849 | 73 | 0.32 |
| 9 | 392306676 | 69 | 0.30 |
| 10 | 10190138 | 66 | 0.29 |
The data reveals that the top 10 hosts with the most listings are led by host 90729398 with 327 listings (1.42%) of total. The next two hosts, 279001183 and 343442154, have 250 (1.08%) and 154 (0.67%) listings respectively. The remaining seven hosts each have less than 100 listings, contributing less than 0.5% to the total each. For a clearer view of the results, I've created the following chart.
# Filter the top 10 hosts
top_10_hosts = data_v4['host_id'].astype(str).value_counts().head(10).sort_values()
# Create the graph
fig = go.Figure(data=[go.Bar(
x=top_10_hosts.values, y=top_10_hosts.index,
orientation='h', text=top_10_hosts.values, textposition='auto',
hovertemplate='Host ID: %{y}<br>Number of Listings: %{x}<extra></extra>',
marker_color='#f5b7b9'
)])
# Customize the layout
fig.update_layout(
autosize=False, title='<b>Top 10 Hosts Listings</b>', title_x=0.5,
width=800, height=500,
hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell")
)
# Add the source
fig.add_annotation(
text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False,
xref='paper', yref='paper', x=1, y=-0.2,
bgcolor='white', font=dict(size=12, color='black')
)
# Show the chart
fig.show()
# Create a frequency table for the top 10 neighbourhoods
top_10_neighbourhoods = pd.DataFrame({
'Neighbourhood': data_v4['neighbourhood_cleansed'].value_counts().head(10).index,
'Number of Listings': data_v4['neighbourhood_cleansed'].value_counts().head(10).values,
'Percentage of Listings': np.round((data_v4['neighbourhood_cleansed'].value_counts().head(10) / data_v4.shape[0]) * 100, 2)
})
# Reset the index to get a standard numeric index
top_10_neighbourhoods.reset_index(drop=True, inplace=True)
# Adjust the index to start at 1 instead of 0
top_10_neighbourhoods.index = top_10_neighbourhoods.index + 1
# Display the frequency table
top_10_neighbourhoods
| Neighbourhood | Number of Listings | Percentage of Listings | |
|---|---|---|---|
| 1 | Melbourne | 7246 | 31.40 |
| 2 | Port Phillip | 2723 | 11.80 |
| 3 | Yarra | 1771 | 7.67 |
| 4 | Stonnington | 1501 | 6.50 |
| 5 | Yarra Ranges | 1106 | 4.79 |
| 6 | Merri-bek | 940 | 4.07 |
| 7 | Monash | 590 | 2.56 |
| 8 | Boroondara | 586 | 2.54 |
| 9 | Whitehorse | 582 | 2.52 |
| 10 | Wyndham | 575 | 2.49 |
The data reveals that the majority of Airbnb listings in Melbourne are concentrated in the Melbourne neighbourhood itself, accounting for 31.40% of the total listings. The next two neighbourhoods, Port Phillip and Yarra, have 11.80% and 7.67% of the listings respectively. The remaining seven neighbourhoods each have less than 7% of the total listings, indicating a more dispersed distribution of Airbnb listings in these areas.
# Filter the top 10 neighbourhoods
top_10_neighbourhoods = data_v4['neighbourhood_cleansed'].value_counts().head(10).sort_values()
# Create the graph
fig = go.Figure(data=[go.Bar(
x=top_10_neighbourhoods.values, y=top_10_neighbourhoods.index,
orientation='h', text=top_10_neighbourhoods.values, textposition='auto',
hovertemplate='Neighbourhood: %{y}<br>Number of Listings: %{x}<extra></extra>',
marker_color='#f5b7b9'
)])
# Customize the layout
fig.update_layout(
autosize=False, title='<b>Top 10 Neighbourhoods with Most Listings</b>', title_x=0.5,
width=800, height=500,
hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell")
)
# Add the source
fig.add_annotation(
text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False,
xref='paper', yref='paper', x=1, y=-0.2,
bgcolor='white', font=dict(size=12, color='black')
)
# Show the chart
fig.show()
Here's a graph map displaying the results.
# Load the GeoJSON file
gdf = gpd.read_file('/kaggle/input/melbourne-airbnb-september-2023-dataset/neighbourhoods.geojson')
# Count the number of listings and add to the GeoDataFrame
gdf['listings'] = gdf['neighbourhood'].map(data_v4['neighbourhood_cleansed'].value_counts())
# Filter the 10 neighbourhoods with the most listings
top_10_neighbourhoods = gdf.nlargest(10, 'listings')
# Create a new column with the neighbourhood names
top_10_neighbourhoods['Neighbourhood'] = top_10_neighbourhoods['neighbourhood']
# Create the interactive map plot
fig = px.choropleth_mapbox(top_10_neighbourhoods, geojson=gdf.geometry,
locations=top_10_neighbourhoods.index, color='listings',
hover_data={'Neighbourhood':True, 'listings':True},
color_continuous_scale=[(0, '#FFFFFF'), (1, '#ff4c5e')], # Custom colorscale
mapbox_style="open-street-map",
zoom=10, center = {"lat": -37.8136, "lon": 144.9631}, opacity=0.5,
labels={'listings':'Number of Listings'})
# Customize the layout
fig.update_layout(
autosize=False, width=800, height=500, margin={"r":0,"t":0,"l":0,"b":0},
hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"
))
# Update the hover template
fig.update_traces(hovertemplate='<b>Neighbourhood: %{customdata[0]}</b><br><b>Number of Listings: %{z}</b>')
# Show the chart
fig.show()
# Create a frequency table for room types
room_type = pd.DataFrame({
'Room Type': data_v4['room_type'].value_counts().head().index,
'Number of Listings': data_v4['room_type'].value_counts().head().values,
'Percentage of Listings': np.round((data_v4['room_type'].value_counts().head() / data_v4.shape[0]) * 100, 2)
})
# Reset the index to get a standard numeric index
room_type.reset_index(drop=True, inplace=True)
# Adjust the index to start at 1 instead of 0
room_type.index = room_type.index + 1
# Display the frequency table
room_type
| Room Type | Number of Listings | Percentage of Listings | |
|---|---|---|---|
| 1 | Entire home/apt | 16803 | 72.80 |
| 2 | Private room | 5911 | 25.61 |
| 3 | Shared room | 277 | 1.20 |
| 4 | Hotel room | 89 | 0.39 |
The room type in the listings are distributed in 'Entire home/apt' with 16.803 listings, accounting for (72.80%) of the total. This is followed by 'Private room' with 5.911 listings (25.61%), 'Shared room' with 277 listings (1.20%), and 'Hotel room' with 89 listings (0.39%).
# Filter room types and calculate their percentage
room_type = (data_v4['room_type'].value_counts() / data_v4.shape[0] * 100).round(2)
# Create the graph
fig = go.Figure(data=[go.Bar(
x=room_type.index, y=room_type.values,
text=room_type.values, textposition='auto',
hovertemplate='Room type: %{x}<br>Number of Listings: %{text}%<extra></extra>',
marker_color='#f5b7b9'
)])
# Customize the layout
fig.update_layout(
autosize=False, title='<b>Room Type Distribution</b>', title_x=0.5,
width=800, height=500,
hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell")
)
# Add the source
fig.add_annotation(
text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False,
xref='paper', yref='paper', x=1, y=-0.2,
bgcolor='white', font=dict(size=12, color='black')
)
# Show the chart
fig.show()
# Create a frequency table for the top 5 bedroom counts
top_5_bedrooms = pd.DataFrame({
'Number of Bedroom': data_v4['bedrooms'].value_counts().head().index,
'Number of Listings': data_v4['bedrooms'].value_counts().head().values,
'Percentage of Listings': np.round((data_v4['bedrooms'].value_counts().head() / data_v4.shape[0]) * 100, 2)
})
# Reset the index to get a standard numeric index
top_5_bedrooms.reset_index(drop=True, inplace=True)
# Adjust the index to start at 1 instead of 0
top_5_bedrooms.index = top_5_bedrooms.index + 1
# Display the frequency table
top_5_bedrooms
| Number of Bedroom | Number of Listings | Percentage of Listings | |
|---|---|---|---|
| 1 | 1 | 11386 | 49.33 |
| 2 | 2 | 6926 | 30.01 |
| 3 | 3 | 2527 | 10.95 |
| 4 | 4 | 1051 | 4.55 |
| 5 | 0 | 748 | 3.24 |
The data reveals that the majority of listings have one bedroom, accounting for nearly half of the total with 11,386 listings, or (49.33%). Two-bedroom listings are the next most common, making up (30.01%) with 6,926 of the listings. As we move to larger properties, the numbers decrease, with 3-bedroom and 4-bedroom making up (10.95%) with 2,527 listings and (4.55%) with 1,051 listings respectively. The least common are studios, or 0-bedroom properties, which make up (3.24%) with 748 listings.
# Filter bedrooms booked and calculate their percentage
bedrooms_booked = (data_v4['bedrooms'].value_counts().head() / data_v4.shape[0] * 100).round(2)
# Create the graph
fig = go.Figure(data=[go.Bar(
x=bedrooms_booked.index.astype(str), y=bedrooms_booked.values,
text=bedrooms_booked.values, textposition='auto',
hovertemplate='Number of Bedroom: %{x}<br>Number of Listings: %{text}%<extra></extra>',
marker_color='#f5b7b9'
)])
# Customize the layout
fig.update_layout(
autosize=False, title='<b>The 5 Most Common Bedroom</b>', title_x=0.5,
width=800, height=500,
hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell")
)
# Add an annotation for clarity
fig.add_annotation(
text='<b>Note: Bedroom = 0 (Studio)</b>', align='right', showarrow=False,
xref='paper', yref='paper',
x=1, y=-0.14,
bgcolor='white', font=dict(size=12, color='black')
)
# Add the source
fig.add_annotation(
text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False,
xref='paper', yref='paper', x=1, y=-0.2,
bgcolor='white', font=dict(size=12, color='black')
)
# Show the chart
fig.show()
# Create a frequency table for the most common minimum nights
most_minimum_nights = pd.DataFrame({
'Minimum Nights': data_v4['minimum_nights'].value_counts().head().index,
'Number of Listings': data_v4['minimum_nights'].value_counts().head().values,
'Percentage of Listings': np.round((data_v4['minimum_nights'].value_counts().head() / data_v4.shape[0]) * 100, 2)
})
# Reset the index to get a standard numeric index
most_minimum_nights.reset_index(drop=True, inplace=True)
# Adjust the index to start at 1 instead of 0
most_minimum_nights.index = most_minimum_nights.index + 1
# Display the frequency table
most_minimum_nights
| Minimum Nights | Number of Listings | Percentage of Listings | |
|---|---|---|---|
| 1 | 1 | 7895 | 34.21 |
| 2 | 2 | 7325 | 31.74 |
| 3 | 3 | 3008 | 13.03 |
| 4 | 7 | 1257 | 5.45 |
| 5 | 5 | 1021 | 4.42 |
The five most frequent minimum nights are 1, 2, and 3 which together account for 78.98% of the total listings. This suggest that most hosts accommodates short-term stays and cater to travelers who are looking for flexibility and convinience. On the other hand, listings with minimum nights of 5 or more are relatively rare. This shows that hosts who prefer long-term stays are less common.
# Create a frequency table for the most common minimum nights
most_minimum_nights = pd.DataFrame({
'Minimum Nights': data_v4['minimum_nights'].value_counts().head().index,
'Number of Listings': data_v4['minimum_nights'].value_counts().head().values,
'Percentage of Listings': np.round((data_v4['minimum_nights'].value_counts().head() / data_v4.shape[0]) * 100, 2)
})
# Reset the index to get a standard numeric index
most_minimum_nights.reset_index(drop=True, inplace=True)
# Adjust the index to start at 1 instead of 0
most_minimum_nights.index = most_minimum_nights.index + 1
# Create the graph
fig = go.Figure(data=[go.Bar(
x=most_minimum_nights['Minimum Nights'].astype(str), y=most_minimum_nights['Number of Listings'],
text=most_minimum_nights['Number of Listings'], textposition='auto',
hovertemplate='Minimum Nights: %{x}<br>Number of Listings: %{y}<extra></extra>',
marker_color='#f5b7b9'
)])
# Customize the layout
fig.update_layout(
autosize=False, title='<b>The 5 Most Minimum Nights</b>', title_x=0.5,
width=800, height=500,
hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell")
)
# Add the source
fig.add_annotation(
text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False,
xref='paper', yref='paper', x=1, y=-0.2,
bgcolor='white', font=dict(size=12, color='black')
)
# Show the chart
fig.show()
# Transform the string series into a list series and counting the total of each amenity
amenities = [item for sublist in data_v4['amenities'].apply(ast.literal_eval) for item in sublist]
quantity_amenities = pd.Series(amenities).value_counts().head(4)
# Create the frequency table
quantity_amenities = pd.DataFrame({
'Amenity': quantity_amenities.index,
'Number of Amenities': quantity_amenities.values,
'Percentage of Amenities': np.round((quantity_amenities.values / data_v4.shape[0]) * 100, 2)
})
# Reset the index to get a standard numeric index
quantity_amenities.reset_index(drop=True, inplace=True)
# Adjust the index to start at 1
quantity_amenities.index = quantity_amenities.index + 1
# Display the frequency table
quantity_amenities
| Amenity | Number of Amenities | Percentage of Amenities | |
|---|---|---|---|
| 1 | Kitchen | 21865 | 94.74 |
| 2 | Smoke alarm | 21605 | 93.61 |
| 3 | Wifi | 20307 | 87.99 |
| 4 | Essentials | 20017 | 86.73 |
The data shows that the most popular amenities are those that are essential for a comfortable and safe stay, such as kitchen, smoke alarm, wifi, and essentials. These amenities are provided by more than 85% of the listings, indicating that they are highly valued by both hosts and guests.
# Sort the DataFrame in ascending order by 'Number of Amenities'
quantity_amenities = quantity_amenities.sort_values('Number of Amenities', ascending=True)
# Create the bar chart
fig = go.Figure(go.Bar(
x=quantity_amenities['Number of Amenities'], y=quantity_amenities['Amenity'], orientation='h',
text=quantity_amenities['Number of Amenities'], textposition='auto',
hovertemplate='Amenities: %{y}<br>Number of Amenities: %{x}<extra></extra>', marker_color='#f5b7b9'
))
# Customize the layout
fig.update_layout(
autosize=False, title_text='<b>4 Amenities Most Commonly Provided</b>', title_x=0.5,
width=800, height=500, hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"
))
# Add the source
fig.add_annotation(
text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, xref='paper', yref='paper',
x=1, y=-0.2, bgcolor='white', font=dict(size=12, color='black'
))
# Show the chart
fig.show()
Bivariate analysis is a statistical method that helps you investigate relationships bewtween two different variables. This analysis can provide valuable insights into how these variables interact with each other.
# Create a frequency table for room types and their reviews
room_type_reviews = pd.DataFrame({
'Room Type': data_v4['room_type'].value_counts().index,
'Number of Reviews': data_v4.groupby('room_type')['number_of_reviews'].sum(),
'Percentage of Reviews': np.round(data_v4.groupby('room_type')['number_of_reviews'].sum() / data_v4['number_of_reviews'].sum() * 100, 2)
})
# Sort the values in the frequency table by 'Number of Reviews'
room_type_reviews = room_type_reviews.nlargest(4, 'Number of Reviews').reset_index(drop=True)
# Display the frequency table
room_type_reviews
| Room Type | Number of Reviews | Percentage of Reviews | |
|---|---|---|---|
| 0 | Entire home/apt | 613773 | 86.20 |
| 1 | Shared room | 94852 | 13.32 |
| 2 | Private room | 1765 | 0.25 |
| 3 | Hotel room | 1674 | 0.24 |
The data shows that the room type Entire home/apt has the highest number of reviews, accounting for 86.20% of the total reviews. This suggests that this room type is the most popular and preferred among guests, as they are likely to leave more feedback when they are satisfied with their stay.
# Create the graph
fig = go.Figure(data=[go.Bar(x=room_type_reviews['Room Type'], y=room_type_reviews['Number of Reviews'],
text=room_type_reviews['Number of Reviews'], textposition='auto',
hovertemplate='Room Type: %{x}<br>Number of Reviews: %{y}<extra></extra>', marker_color='#f5b7b9'
)])
# Customize the layout
fig.update_layout(title='<b>Room Type with the Highest Reviews</b>', title_x=0.5,
width=800, height=500,
hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"
))
# Add the source
fig.add_annotation(text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, xref='paper', yref='paper',
x=1, y=-0.2, bgcolor='white', font=dict(size=12, color='black'
))
# Show the chart
fig.show()
# Group by room type and calculating the average number of amenities
room_type_amenities = data_v4['room_type'].value_counts().index.to_series().map(lambda x: data_v4[data_v4['room_type'] == x]['amenities'].apply(lambda y: len(ast.literal_eval(y))).mean())
# Create a DataFrame
room_type_amenities = pd.DataFrame({'Room Type': room_type_amenities.index,
'Average Amenities': room_type_amenities.values}).round(2).sort_values('Average Amenities', ascending=False)
# Reset the index to start from 1
room_type_amenities.reset_index(drop=True, inplace=True)
# Adjust the index to start at 1
room_type_amenities.index = room_type_amenities.index + 1
# Display the DataFrame
room_type_amenities
| Room Type | Average Amenities | |
|---|---|---|
| 1 | Entire home/apt | 35.14 |
| 2 | Private room | 24.00 |
| 3 | Hotel room | 23.89 |
| 4 | Shared room | 18.20 |
These averages indicate a noticeable variation in the number of amenities across different room types. 'Entire home/apt' listings provide the most amenities on average 35.15, while 'Shared room' provide the fewest 18.20. This trend aligns with our hypothesis that the average number of amenities provided varies by room type.
# Create the graph
fig = go.Figure(data=[go.Bar(
x=room_type_amenities['Room Type'], y=room_type_amenities['Average Amenities'],
text=room_type_amenities['Average Amenities'], textposition='auto', hovertemplate='Room Type: %{x}<br>Average Amenities: %{y}<extra></extra>',
marker_color='#f5b7b9'
)])
# Customize the layout
fig.update_layout(
title='<b>Average Amenities by Room Type</b>', title_x=0.5, width=800, height=500,
hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"
))
# Add the source
fig.add_annotation(
text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, xref='paper', yref='paper', x=1, y=-0.2,
bgcolor='white', font=dict(size=12, color='black'
))
# Show the chart
fig.show()
# Create a table with the average for each room type
average_room_type = pd.DataFrame(
round(data_v4[data_v4['room_type'].isin(data_v4['room_type'].value_counts().head(5).index)]
.groupby('room_type')['price'].mean(),2)).reset_index().rename(columns={'room_type': 'Room Type', 'price': 'Average Price'}).sort_values('Average Price', ascending=True)
# Reset the index to start from 1
average_room_type = average_room_type.reset_index(drop=True)
# Adjust the index to start at 1
average_room_type.index = average_room_type.index + 1
# Show the table
average_room_type
| Room Type | Average Price | |
|---|---|---|
| 1 | Private room | 100.22 |
| 2 | Shared room | 117.17 |
| 3 | Entire home/apt | 232.23 |
| 4 | Hotel room | 243.19 |
The hypothesis predicts that the average price per night for private rooms would be higher than for shared rooms. However, the data shows the average price for a private room $100.22 is actually lower than for a shared room $117.17. Therefore, this specific set of data does not support this hypothesis.
# Create the graph
fig = go.Figure(data=[go.Bar(
x=average_room_type['Room Type'], y=average_room_type['Average Price'],
text=average_room_type['Average Price'], textposition='auto',
hovertemplate='Room Type: %{x}<br>Average Price: %{y}<extra></extra>', marker_color='#f5b7b9'
)])
# Customize the layout
fig.update_layout(
title='<b>Average Prices for each Room Type</b>', title_x=0.5,
width=800, height=500,
hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"
))
# Add the source
fig.add_annotation(
text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, xref='paper', yref='paper',
x=1, y=-0.2, bgcolor='white', font=dict(size=12, color='black'
))
# Show the chart
fig.show()
Correlation analysis is a specific type of bivariate analysis that measures the strength and direction of the linear relationship between two variables. It results in a correlation coefficient, a value between -1 and 1. A correlation coefficient close to 1 indicates a strong positive relationship, a coefficient close to -1 indicates a strong negative relationship, and a coefficient close to 0 indicates no linear relationship.
To perform this analysis, first I will select quantitative variables using the .select_dtypes() method. This method will include int64 and float64 data types. Then, I will use the .iloc method to select only the variables of interest.
# Select quantitative variables (without latitude and longitude)
quantitative_variables = data_v4.select_dtypes(include=['int64', 'float64'])
quantitative_variables = quantitative_variables.iloc[:, 3:]
# Preview the 3 rows
quantitative_variables.head(3)
| accommodates | bedrooms | beds | price | minimum_nights | number_of_reviews | |
|---|---|---|---|---|---|---|
| 0 | 2 | 1 | 1 | 49.0 | 5 | 173 |
| 1 | 2 | 1 | 1 | 95.0 | 3 | 42 |
| 2 | 5 | 3 | 3 | 116.0 | 1 | 228 |
Now, I'm going to calculate and display the Pearson coefficient matrix of the quantitative variables. The correlation matrix is a table that shows the correlation coefficients between each pair of variables.
# Calculate the Pearson coeficient
correlation_matrix = quantitative_variables.corr(method='pearson')
# Exhibit correlation_matriz
correlation_matrix
| accommodates | bedrooms | beds | price | minimum_nights | number_of_reviews | |
|---|---|---|---|---|---|---|
| accommodates | 1.000000 | 0.811764 | 0.861758 | 0.482950 | -0.018598 | 0.022064 |
| bedrooms | 0.811764 | 1.000000 | 0.785499 | 0.470027 | -0.003285 | -0.047187 |
| beds | 0.861758 | 0.785499 | 1.000000 | 0.438041 | -0.005333 | -0.000953 |
| price | 0.482950 | 0.470027 | 0.438041 | 1.000000 | 0.022074 | -0.047232 |
| minimum_nights | -0.018598 | -0.003285 | -0.005333 | 0.022074 | 1.000000 | -0.027315 |
| number_of_reviews | 0.022064 | -0.047187 | -0.000953 | -0.047232 | -0.027315 | 1.000000 |
Here I'm going to create a heatmap of the correlation matrix. A heatmap is a graphical representation of data where the values are shown by different colors. This helps visualize the strength and direction of the correlations between the variables.
# Get correlation matrix
z = correlation_matrix.values.round(2)
# Get variable names
x = correlation_matrix.columns
y = correlation_matrix.index
# Create the heatmap
heatmap = go.Figure(data=go.Heatmap(
z=z, x=x, y=y, colorscale=[(0, '#FFFFFE'), (1, '#ff4c5e')],
hoverongaps = False, hoverinfo='z'
))
# Customize the layout
heatmap.update_layout(
title='<b>Correlation Matrix</b>', title_x=0.5,
width=800, height=500,
hoverlabel=dict(font_size=16, font_family="Rockwell"
))
# Add the source
heatmap.add_annotation(
text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, xref='paper', yref='paper',
x=1, y=-0.2, bgcolor='white', font=dict(size=12, color='black'
))
# Add annotations (values) to each cell
for i in range(len(y)):
for j in range(len(x)):
# Round the value to 2 decimal places
value = round(z[i][j], 2)
heatmap.add_annotation(dict(
x=x[j], y=y[i], text=str(value),
showarrow=False, font=dict(size=12)
))
# Show the plot
heatmap.show()
# Calculate the correlation
correlation_price_review = data_v4[['price', 'number_of_reviews']].corr().round(2)
# Display the correlation matrix
correlation_price_review
| price | number_of_reviews | |
|---|---|---|
| price | 1.00 | -0.05 |
| number_of_reviews | -0.05 | 1.00 |
The correlation analysis revealed a weak negative correlation of -0.05 between the number of reviews and the price. This suggests that properties with more reviews do not necessarily have higher prices. In fact, the slight negative correlation indicates that properties with more reviews might be priced slightly lower, although the correlation is so weak that it's not a strong relationship.
# Create a scatter plot of 'number_of_reviews' vs 'price' using Plotly graph objects
scatter = go.Scatter(
x=data_v4['number_of_reviews'], y=data_v4['price'], mode='markers',
hovertemplate='Number of Reviews: %{x}<br>Price: %{y}<extra></extra>', marker_color='#f5b7b9'
)
# Customize the layout
layout = go.Layout(
title='<b>Price vs Number of Reviews</b>', title_x=0.5,
xaxis_title='<b>Number of Reviews</b>', yaxis_title='<b>Price</b>',
width=800, height=500,
hoverlabel=dict(bgcolor='white', font_size=16, font_family="Rockwell"
))
# Create the figure and adding the scatter plot
fig = go.Figure(data=scatter, layout=layout)
# Add the source
fig.add_annotation(
text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, xref='paper', yref='paper',
x=1, y=-0.2, bgcolor='white', font=dict(size=12, color='black'
))
# Show the chart
fig.show()
# Calculate the number of amenities for each listing
num_amenities = data_v4['amenities'].apply(lambda x: len(x))
# Check the correlation using the .corr() method, using as argument the Pearson method
pd.concat([num_amenities, data_v4['price']], axis=1).corr(method='pearson').round(2)
| amenities | price | |
|---|---|---|
| amenities | 1.00 | 0.15 |
| price | 0.15 | 1.00 |
The hypothesis states that listings with more amenities might be priced higher. The answer shows that the correlation coefficient between the number of amenities and the price is 0.15, which indicates a weak positive relationship. That means that there is a slight tendenct for listings with more amenities to have higher prices, but it is not a strong or consistent pattern.
# Create a scatterplot to visualize the relationship using Plotly's graph_objects
fig = go.Figure(data=go.Scatter(
x=num_amenities, y=data_v4['price'], mode='markers',
hovertemplate='Number of Amenities: %{x}<br>Price: %{y}<extra></extra>', marker_color='#f5b7b9'
))
# Customize the layout
fig.update_layout(title='<b>Price vs Number of Amenities</b>', title_x=0.5,
xaxis_title='<b>Number of Amenities</b>', yaxis_title='<b>Price</b>',
width=800, height=500,
hoverlabel=dict(bgcolor='white', font_size=16, font_family="Rockwell"
))
# Add the source
fig.add_annotation(
text='<b>Source: Inside Airbnb</b>', align='right', showarrow=False, xref='paper', yref='paper',
x=1, y=-0.2, bgcolor='white', font=dict(size=12, color='black'
))
# Show the chart
fig.show()
Summary of the analysis:
The distribution of listings among hosts is highly varied, with the top host having nearly five times as many listings as the 10th host. Despite this, the top 10 hosts account for a relatively small percentage of the total listings, suggesting a diverse and broad range of hosts.
The data indicates a pronounced concentration of Airbnb listings in the Melbourne neighbourhood, with a significant drop in the number of listings in other neighbourhoods. This could suggest a higher demand for Airbnb accommodations in the Melbourne neighbourhood compared to others. Further analysis could provide insights into the factors influencing this distribution, such as tourist attractions and accessbility in these neighbourhoods.
In summary, the data suggests that in Melbourne's Airbnb accommodations, guests predominantly prefer the comfort and privacy of renting entire homes/apartments or private rooms, which constitute over 98% of all listings. The less common categories, such as shared rooms and hotel rooms, account for just 1.59% of listings, indicating they're less favored.
The data shows that one and two-bedroom accommodations accounts for 49.33% and 30.01% of the listings respectively. These findings suggest that most Airbnb guests in Melbourne prefer smaller and more affordable accommodations, possibly for solo travelers or couples. Listings with three or four bedrooms are much less prevalent, around 15% of the listings, indicating a lower demand. Lastly, we have 3.24% of the listings with no bedroom, or known as studio, being a more compact and economic option for guests.
In summary, the minimum nights distribuition revelas that most Melbourne Airbnb listings are suitable for short-term stays, while long-term stays are less available and less popular.
The data reveals that the most common amenities in Melbourne’s Airbnb listings are kitchen, smoke alarm, wifi, and essentials, which are considered as basic and necessary for a pleasant and secure stay.
In summary, the room type Entire home/apt in Melbourne’s Airbnb listings has the highest number of reviews, implying that it is the most in-demand and most satisfying type of accommodation. The other room types, such as Shared room, Private room, and Hotel room, have much lower numbers and percentages of reviews, suggesting that they are less available and less attractive to guests.
The data reveals a distinct variation in the average number of amenities across different room types. Specifically, listings categorized as ‘Entire home/apt’ offer the highest average number of amenities, approximately 35.14, whereas ‘Shared room’ listings offer the least, around 18.20. This trend supports our initial hypothesis that room type influences the average number of amenities provided. However, to establish this observation as a statistically significant fact, further analysis, such as an ANOVA test would be required.
In our analysis of room types and their average prices, we tested hypothesis number 9. Our data showed that the average price for a private room $100.22 was actually lower than for a shared room $117.17, indicating that the data did not suporrt hypothesis number 9.
The hypothesis that properties with more reviews might be priced higher due to their popularity was not supported by the data. The correlation analysis showed a weak negative correlation of -0.05 between the number of reviews and the price.
Our analysis examined the correlation between the number of amenities in a listing and its price. We hypothesized that more amenities would correlate with higher prices. The data, however, showed a weak positive correlation of 0.15, indicating only a slight increase in price with more amenities, and not a consistent trend.
Please note that this preliminary analysis is based solely on the provided data and does not consider other potentially influential factors. For a more comprehensive understanding of the situation, it would be advisable to incorporate additional aspects such as geographical location, time of year, and local regulations into the analysis.
Based on our problem statement and questions, we have the following conclusion:
Melbourne's Airbnb market is characterized by its diversity, with a wide range of hosts contributing to a rich variety of listings. The majority of these listings are found in the Melbourne neighbourhood, indicating a high demand in this area.
Guests show a clear preference for renting entire homes or apartments, as well as private rooms, with smaller one or two-bedroom accommodations being particularly popular. These listings are predominantly suitable for short-term stays, catering to the needs of transient visitors.
In terms of amenities, the basics such as a kitchen, smoke alarms, Wi-Fi, and essentials are commonly provided across all listings. Among the various types of accommodations, 'Entire home/apt' listings stand out as the most popular and offer the most amenities.
Interestingly, despite offering more personal space, private rooms are generally cheaper than shared rooms. Furthermore, having more reviews or amenities does not necessarily equate to higher prices, suggesting that other factors may be at play in the pricing decisions.
These findings provide a comprehensive overview of the key characteristics of Airbnb listings in Melbourne. However, they are based on the current data and may change over time.